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Preface 


The seeds for this book were first planted in 2001 when Steve Seitz at the University of Wash- 
ington invited me to co-teach a course called “Computer Vision for Computer Graphics”. At 
that time, computer vision techniques were increasingly being used in computer graphics to 
create image-based models of real-world objects, to create visual effects, and to merge real- 
world imagery using computational photography techniques. Our decision to focus on the 
applications of computer vision to fun problems such as image stitching and photo-based 3D 


modeling from personal photos seemed to resonate well with our students. 


That initial course evolved into a more complete computer vision syllabus and project- 
oriented course structure that I used to co-teach general computer vision courses both at the 
University of Washington and at Stanford. (The latter was a course I co-taught with David 
Fleet in 2003.) Similar curricula were then adopted at a number of other universities and also 
incorporated into more specialized courses on computational photography. (For ideas on how 


to use this book in your own course, please see Table 1.1 in Section 1.4.) 


This book also reflects my 40 years” experience doing computer vision research in cor- 
porate research labs, mostly at Digital Equipment Corporation’s Cambridge Research Lab, 
Microsoft Research, and Facebook. In pursuing my work, I have mostly focused on problems 
and solution techniques (algorithms) that have practical real-world applications and that work 
well in practice. Thus, this book has more emphasis on basic techniques that work under real- 
world conditions and less on more esoteric mathematics that has intrinsic elegance but less 
practical applicability. 

This book is suitable for teaching a senior-level undergraduate course in computer vision 
to students in both computer science and electrical engineering. I prefer students to have 
either an image processing or a computer graphics course as a prerequisite, so that they can 
spend less time learning general background mathematics and more time studying computer 
vision techniques. The book is also suitable for teaching graduate-level courses in computer 


vision, e.g., by delving into more specialized topics, and as a general reference to fundamental 
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techniques and the recent research literature. To this end, I have attempted wherever possible 
to at least cite the newest research in each sub-field, even if the technical details are too 
complex to cover in the book itself. 

In teaching our courses, we have found it useful for the students to attempt a number of 
small implementation projects, which often build on one another, in order to get them used to 
working with real-world images and the challenges that these present. The students are then 
asked to choose an individual topic for each of their small-group, final projects. (Sometimes 
these projects even turn into conference papers!) The exercises at the end of each chapter 
contain numerous suggestions for smaller mid-term projects, as well as more open-ended 
problems whose solutions are still active research topics. Wherever possible, I encourage 
students to try their algorithms on their own personal photographs, since this better motivates 
them, often leads to creative variants on the problems, and better acquaints them with the 
variety and complexity of real-world imagery. 

In formulating and solving computer vision problems, I have often found it useful to draw 


inspiration from four high-level approaches: 


e Scientific: build detailed models of the image formation process and develop mathe- 
matical techniques to invert these in order to recover the quantities of interest (where 
necessary, making simplifying assumptions to make the mathematics more tractable). 


e Statistical: use probabilistic models to quantify the prior likelihood of your unknowns 
and the noisy measurement processes that produce the input images, then infer the best 
possible estimates of your desired quantities and analyze their resulting uncertainties. 
The inference algorithms used are often closely related to the optimization techniques 


used to invert the (scientific) image formation processes. 


e Engineering: develop techniques that are simple to describe and implement but that 
are also known to work well in practice. Test these techniques to understand their 
limitation and failure modes, as well as their expected computational costs (run-time 


performance). 


e Data-driven: collect a representative set of test data (ideally, with labels or ground- 
truth answers) and use these data to either tune or learn your model parameters, or at 


least to validate and quantify its performance. 


These four approaches build on each other and are used throughout the book. 
My personal research and development philosophy (and hence the exercises in the book) 


have a strong emphasis on testing algorithms. It’s too easy in computer vision to develop an 
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algorithm that does something plausible on a few images rather than something correct. The 
best way to validate your algorithms is to use a three-part strategy. 

First, test your algorithm on clean synthetic data, for which the exact results are known. 
Second, add noise to the data and evaluate how the performance degrades as a function of 
noise level. Finally, test the algorithm on real-world data, preferably drawn from a wide 
variety of sources, such as photos found on the web. Only then can you truly know if your 
algorithm can deal with real-world complexity, i.e., images that do not fit some simplified 
model or assumptions. 

In order to help students in this process, Appendix C includes pointers to commonly used 
datasets and software libraries that contain implementations of a wide variety of computer 
vision algorithms, which can enable you to tackle more ambitious projects (with your in- 


structor’s consent). 


Notes on the Second Edition 


The last decade has seen a truly dramatic explosion in the performance and applicability of 
computer vision algorithms, much of it engendered by the application of machine learning 
algorithms to large amounts of visual training data (Su and Crandall 2021). 

Deep neural networks now play an essential role in so many vision algorithms that the 
new edition of this book introduces them early on as a fundamental technique that gets used 
extensively in subsequent chapters. 


The most notable changes in the second edition include: 


e Machine learning, deep learning, and deep neural networks are introduced early on in 
Chapter 5, as they play just as fundamental a role in vision algorithms as more classi- 
cal techniques, such as image processing, graphical/probabilistic models, and energy 


minimization, which are introduced in the preceding two chapters. 


e The recognition chapter has been moved earlier in the book to Chapter 6, since end-to- 
end deep learning systems no longer require the development of building blocks such 
as feature detection, matching, and segmentation. Many of the students taking vision 
classes are primarily interested in visual recognition, so presenting this material earlier 
in the course makes it easier for students to base their final project on these topics. 
This chapter also includes sections on semantic segmentation, video understanding, 


and vision and language. 


e The application of neural networks and deep learning to myriad computer vision al- 


gorithms and applications, including flow and stereo, 3D shape modeling, and newly 
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emerging fields such as neural rendering. 


e New technologies such as SLAM (simultaneous localization and mapping) and VIO 
(visual inertial odometry) that now run reliably and are used in real-time applications 


such as augmented reality and autonomous navigation. 


In addition to these larger changes, the book has been updated to reflect the latest state-of- 
the-art techniques such as internet-scale image search and phone-based computational pho- 
tography. The new edition includes over 1500 new citations (papers) and has over 200 new 


figures. 
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Chapter 1 


Introduction 


11 “What 1is:Computer Vision? oa A Ae A 3 
12: Abe hisiOry: ea E A aaa 10 
K3 BOOKOVEVISW cerrara e A 22 
kA Sample: yllaDUS:< 24-4 2445 be Bae CS EG HAS Td da 30 
15 Anoteonmotaion .. 2254566464844 e446 a e 31 
16° ¿Additional teading: +. cerro rad a debe ea ee 31 


Figure 1.1 The human visual system has no problem interpreting the subtle variations in 
translucency and shading in this photograph and correctly segmenting the object from its 
background. 
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(d) 


Figure 1.2 Some examples of computer vision algorithms and applications. (a) Face de- 
tection algorithms, coupled with color-based clothing and hair detection algorithms, can 
locate and recognize the individuals in this image (Sivic, Zitnick, and Szeliski 2006) O 2006 
Springer. (b) Object instance segmentation can delineate each person and object in a com- 
plex scene (He, Gkioxari et al. 2017) O 2017 IEEE. (c) Structure from motion algorithms 
can reconstruct a sparse 3D point model of a large complex scene from hundreds of par- 
tially overlapping photographs (Snavely, Seitz, and Szeliski 2006) O 2006 ACM. (d) Stereo 
matching algorithms can build a detailed 3D model of a building facade from hundreds of 
differently exposed photographs taken from the internet (Goesele, Snavely et al. 2007) O 2007 
IEEE. 


1.1 What is computer vision? 3 
1.1 What is computer vision? 


As humans, we perceive the three-dimensional structure of the world around us with appar- 
ent ease. Think of how vivid the three-dimensional percept is when you look at a vase of 
flowers sitting on the table next to you. You can tell the shape and translucency of each petal 
through the subtle patterns of light and shading that play across its surface and effortlessly 
segment each flower from the background of the scene (Figure 1.1). Looking at a framed 
group portrait, you can easily count and name all of the people in the picture and even guess 
at their emotions from their facial expressions (Figure 1.2a). Perceptual psychologists have 
spent decades trying to understand how the visual system works and, even though they can 
devise optical illusions! to tease apart some of its principles (Figure 1.3), a complete solution 
to this puzzle remains elusive (Marr 1982; Wandell 1995; Palmer 1999; Livingstone 2008; 
Frisby and Stone 2010). 

Researchers in computer vision have been developing, in parallel, mathematical tech- 
niques for recovering the three-dimensional shape and appearance of objects in imagery. 
Here, the progress in the last two decades has been rapid. We now have reliable techniques for 
accurately computing a 3D model of an environment from thousands of partially overlapping 
photographs (Figure 1.2c). Given a large enough set of views of a particular object or facade, 
we can create accurate dense 3D surface models using stereo matching (Figure 1.2d). We can 
even, with moderate success, delineate most of the people and objects in a photograph (Fig- 
ure 1.2a). However, despite all of these advances, the dream of having a computer explain an 
image at the same level of detail and causality as a two-year old remains elusive. 

Why is vision so difficult? In part, it is because it is an inverse problem, in which we seek 
to recover some unknowns given insufficient information to fully specify the solution. We 
must therefore resort to physics-based and probabilistic models, or machine learning from 
large sets of examples, to disambiguate between potential solutions. However, modeling the 
visual world in all of its rich complexity is far more difficult than, say, modeling the vocal 
tract that produces spoken sounds. 

The forward models that we use in computer vision are usually developed in physics (ra- 
diometry, optics, and sensor design) and in computer graphics. Both of these fields model 
how objects move and animate, how light reflects off their surfaces, is scattered by the atmo- 
sphere, refracted through camera lenses (or human eyes), and finally projected onto a flat (or 
curved) image plane. While computer graphics are not yet perfect, in many domains, such 


as rendering a still scene composed of everyday objects or animating extinct creatures such 


1Some fun pages with striking illusions include https://michaelbach.de/ot, https://www.illusionsindex.org, and 


http://www.ritsumei.ac.jp/~akitaoka/index-e.html. 
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Figure 1.3 Some common optical illusions and what they might tell us about the visual 
system: (a) The classic Miiller-Lyer illusion, where the lengths of the two horizontal lines 
appear different, probably due to the imagined perspective effects. (b) The “white” square B 
in the shadow and the “black” square A in the light actually have the same absolute intensity 
value. The percept is due to brightness constancy, the visual system's attempt to discount 
illumination when interpreting colors. Image courtesy of Ted Adelson, http://persci.mit.edu/ 
gallery/checkershadow. (c) A variation of the Hermann grid illusion, courtesy of Hany Farid. 
As you move your eyes over the figure, gray spots appear at the intersections. (d) Count the 
red Xs in the left half of the figure. Now count them in the right half. Is it significantly 
harder? The explanation has to do with a pop-out effect (Treisman 1985), which tells us 


about the operations of parallel perception and integration pathways in the brain. 
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as dinosaurs, the illusion of reality is essentially there. 

In computer vision, we are trying to do the inverse, i.e., to describe the world that we 
see in one or more images and to reconstruct its properties, such as shape, illumination, 
and color distributions. It is amazing that humans and animals do this so effortlessly, while 
computer vision algorithms are so error prone. People who have not worked in the field often 
underestimate the difficulty of the problem. This misperception that vision should be easy 
dates back to the early days of artificial intelligence (see Section 1.2), when it was initially 
believed that the cognitive (logic proving and planning) parts of intelligence were intrinsically 
more difficult than the perceptual components (Boden 2006). 

The good news is that computer vision is being used today in a wide variety of real-world 


applications, which include: 


Optical character recognition (OCR): reading handwritten postal codes on letters 
(Figure 1.4a) and automatic number plate recognition (ANPR); 


Machine inspection: rapid parts inspection for quality assurance using stereo vision 
with specialized illumination to measure tolerances on aircraft wings or auto body parts 


(Figure 1.4b) or looking for defects in steel castings using X-ray vision; 


Retail: object recognition for automated checkout lanes and fully automated stores 
(Wingfield 2019); 


Warehouse logistics: autonomous package delivery and pallet-carrying “drives” (Guizzo 
2008; O’ Brian 2019) and parts picking by robotic manipulators (Figure 1.4c; Acker- 
man 2020); 


Medical imaging: registering pre-operative and intra-operative imagery (Figure 1.4d) 
or performing long-term studies of people’s brain morphology as they age; 


Self-driving vehicles: capable of driving point-to-point between cities (Figure 1.4e; 
Montemerlo, Becker et al. 2008; Urmson, Anhalt et al. 2008; Janai, Güney et al. 2020) 
as well as autonomous flight (Kaufmann, Gehrig et al. 2019); 


3D model building (photogrammetry): fully automated construction of 3D models 
from aerial and drone photographs (Figure 1.4f); 


Match move: merging computer-generated imagery (CGI) with live action footage by 
tracking feature points in the source video to estimate the 3D camera motion and shape 
of the environment. Such techniques are widely used in Hollywood, e.g., in movies 
such as Jurassic Park (Roble 1999; Roble and Zafar 2009); they also require the use of 
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Figure 1.4 Some industrial applications of computer vision: (a) optical char- 
acter recognition (OCR), http://yann.lecun.com/exdb/lenet; (b) mechanical inspection, 
http://www.cognitens.com; (c) warehouse picking, https://covariant.ai; (d) medical 
imaging, http://www.clarontech.com; (e) self-driving cars, (Montemerlo, Becker et al. 
2008) © 2008 Wiley; (f) drone-based photogrammetry, https://www.pix4d.com/blog/ 


mapping-chillon-castle-with-drone. 
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precise matting to insert new elements between foreground and background elements 
(Chuang, Agarwala et al. 2002). 


e Motion capture (mocap): using retro-reflective markers viewed from multiple cam- 
eras or other vision-based techniques to capture actors for computer animation; 


e Surveillance: monitoring for intruders, analyzing highway traffic and monitoring pools 


for drowning victims (e.g., https://swimeye.com); 


e Fingerprint recognition and biometrics: for automatic access authentication as well 
as forensic applications. 


David Lowe’s website of industrial vision applications (http://www.cs.ubc.ca/spider/lowe/ 
vision.html) lists many other interesting industrial applications of computer vision. While 
the above applications are all extremely important, they mostly pertain to fairly specialized 
kinds of imagery and narrow domains. 

In addition to all of these industrial applications, there exist myriad consumer-level ap- 
plications, such as things you can do with your own personal photographs and video. These 
include: 


e Stitching: turning overlapping photos into a single seamlessly stitched panorama (Fig- 
ure 1.5a), as described in Section 8.2; 


e Exposure bracketing: merging multiple exposures taken under challenging lighting 
conditions (strong sunlight and shadows) into a single perfectly exposed image (Fig- 
ure 1.5b), as described in Section 10.2; 


e Morphing: turning a picture of one of your friends into another, using a seamless 


morph transition (Figure 1.5c); 


e 3D modeling: converting one or more snapshots into a 3D model of the object or 


person you are photographing (Figure 1.5d), as described in Section 13.6; 


e Video match move and stabilization: inserting 2D pictures or 3D models into your 
videos by automatically tracking nearby reference points (see Section 11.4.4)? or using 
motion estimates to remove shake from your videos (see Section 9.2.1); 


e Photo-based walkthroughs: navigating a large collection of photographs, such as the 
interior of your house, by flying between different photos in 3D (see Sections 14.1.2 
and 14.5.5); 


For a fun student project on this topic, see the “PhotoBook” project at http://www.cc.gatech.edu/dvfx/videos/ 
dvfx2005.html. 
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e Face detection: for improved camera focusing as well as more relevant image search- 


ing (see Section 6.3.1); 


e Visual authentication: automatically logging family members onto your home com- 


puter as they sit down in front of the webcam (see Section 6.2.4). 


The great thing about these applications is that they are already familiar to most students; 
they are, at least, technologies that students can immediately appreciate and use with their 
own personal media. Since computer vision is a challenging topic, given the wide range 
of mathematics being covered? and the intrinsically difficult nature of the problems being 
solved, having fun and relevant problems to work on can be highly motivating and inspiring. 

The other major reason why this book has a strong focus on applications is that they can 
be used to formulate and constrain the potentially open-ended problems endemic in vision. 
Thus, it is better to think back from the problem at hand to suitable techniques, rather than to 
grab the first technique that you may have heard of. This kind of working back from problems 
to solutions is typical of an engineering approach to the study of vision and reflects my own 
background in the field. 

First, I come up with a detailed problem definition and decide on the constraints and 
specifications for the problem. Then, I try to find out which techniques are known to work, 
implement a few of these, evaluate their performance, and finally make a selection. In order 
for this process to work, it is important to have realistic test data, both synthetic, which 
can be used to verify correctness and analyze noise sensitivity, and real-world data typical of 
the way the system will finally be used. If machine learning is being used, it is even more 
important to have representative unbiased training data in sufficient quantity to obtain good 
results on real-world inputs. 

However, this book is not just an engineering text (a source of recipes). It also takes a 
scientific approach to basic vision problems. Here, I try to come up with the best possible 
models of the physics of the system at hand: how the scene is created, how light interacts 
with the scene and atmospheric effects, and how the sensors work, including sources of noise 
and uncertainty. The task is then to try to invert the acquisition process to come up with the 
best possible description of the scene. 

The book often uses a statistical approach to formulating and solving computer vision 
problems. Where appropriate, probability distributions are used to model the scene and the 
noisy image acquisition process. The association of prior distributions with unknowns is often 


called Bayesian modeling (Appendix B). It is possible to associate a risk or loss function with 


3These techniques include physics, Euclidean and projective geometry, statistics, and optimization. They make 


computer vision a fascinating field to study and a great way to learn techniques widely applicable in other fields. 
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Shopping Navigation È Fun photos 


(d) 


Figure 1.5 Some consumer applications of computer vision: (a) image stitching: merging 
different views (Szeliski and Shum 1997) © 1997 ACM; (b) exposure bracketing: merging 
different exposures; (c) morphing: blending between two photographs (Gomes, Darsa et 
al. 1999) © 1999 Morgan Kaufmann; (d) smartphone augmented reality showing real-time 
depth occlusion effects (Valentin, Kowdle et al. 2018) © 2018 ACM. 
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misestimating the answer (Section B.2) and to set up your inference algorithm to minimize 
the expected risk. (Consider a robot trying to estimate the distance to an obstacle: it is 
usually safer to underestimate than to overestimate.) With statistical techniques, it often helps 
to gather lots of training data from which to learn probabilistic models. Finally, statistical 
approaches enable you to use proven inference techniques to estimate the best answer (or 
distribution of answers) and to quantify the uncertainty in the resulting estimates. 

Because so much of computer vision involves the solution of inverse problems or the esti- 
mation of unknown quantities, my book also has a heavy emphasis on algorithms, especially 
those that are known to work well in practice. For many vision problems, it is all too easy to 
come up with a mathematical description of the problem that either does not match realistic 
real-world conditions or does not lend itself to the stable estimation of the unknowns. What 
we need are algorithms that are both robust to noise and deviation from our models and rea- 
sonably efficient in terms of run-time resources and space. In this book, I go into these issues 
in detail, using Bayesian techniques, where applicable, to ensure robustness, and efficient 
search, minimization, and linear system solving algorithms to ensure efficiency.* Most of the 
algorithms described in this book are at a high level, being mostly a list of steps that have to 
be filled in by students or by reading more detailed descriptions elsewhere. In fact, many of 
the algorithms are sketched out in the exercises. 

Now that I’ve described the goals of this book and the frameworks that I use, I devote the 
rest of this chapter to two additional topics. Section 1.2 is a brief synopsis of the history of 
computer vision. It can easily be skipped by those who want to get to “the meat” of the new 
material in this book and do not care as much about who invented what when. 

The second is an overview of the book’s contents, Section 1.3, which is useful reading for 
everyone who intends to make a study of this topic (or to jump in partway, since it describes 
chapter interdependencies). This outline is also useful for instructors looking to structure 
one or more courses around this topic, as it provides sample curricula based on the book’s 
contents. 


1.2 A brief history 


In this section, I provide a brief personal synopsis of the main developments in computer vi- 
sion over the last fifty years (Figure 1.6) with a focus on advances I find personally interesting 
and that have stood the test of time. Readers not interested in the provenance of various ideas 


and the evolution of this field should skip ahead to the book overview in Section 1.3. 


4In some cases, deep neural networks have also been shown to be an effective way to speed up algorithms that 


previously relied on iteration (Chen, Xu, and Koltun 2017). 
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Figure 1.6 A rough timeline of some of the most active topics of research in computer 


vision. 


1970s. When computer vision first started out in the early 1970s, it was viewed as the 
visual perception component of an ambitious agenda to mimic human intelligence and to 
endow robots with intelligent behavior. At the time, it was believed by some of the early 
pioneers of artificial intelligence and robotics (at places such as MIT, Stanford, and CMU) 
that solving the “visual input” problem would be an easy step along the path to solving more 
difficult problems such as higher-level reasoning and planning. According to one well-known 
story, in 1966, Marvin Minsky at MIT asked his undergraduate student Gerald Jay Sussman 
to “spend the summer linking a camera to a computer and getting the computer to describe 
what it saw” (Boden 2006, p. 781).? We now know that the problem is slightly more difficult 
than that. 

What distinguished computer vision from the already existing field of digital image pro- 
cessing (Rosenfeld and Pfaltz 1966; Rosenfeld and Kak 1976) was a desire to recover the 
three-dimensional structure of the world from images and to use this as a stepping stone to- 
wards full scene understanding. Winston (1975) and Hanson and Riseman (1978) provide 
two nice collections of classic papers from this early period. 


Early attempts at scene understanding involved extracting edges and then inferring the 


>Boden (2006) cites (Crevier 1993) as the original source. The actual Vision Memo was authored by Seymour 
Papert (1966) and involved a whole cohort of students. 

6To see how far robotic vision has come in the last six decades, have a look at some of the videos on the Boston 
Dynamics https://www.bostondynamics.com, Skydio https://www.skydio.com, and Covariant https://covariant.ai 


websites. 
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Figure 1.7 Some early (1970s) examples of computer vision algorithms: (a) line labeling 
(Nalwa 1993) O 1993 Addison-Wesley, (b) pictorial structures (Fischler and Elschlager 1973) 
O 1973 IEEE, (c) articulated body model (Marr 1982) O 1982 David Marr, (d) intrinsic 
images (Barrow and Tenenbaum 1981) O 1973 IEEE, (e) stereo correspondence (Marr 1982) 
O 1982 David Marr, (f) optical flow (Nagel and Enkelmann 1986) O 1986 IEEE. 


3D structure of an object or a “blocks world” from the topological structure of the 2D lines 
(Roberts 1965). Several line labeling algorithms (Figure 1.7a) were developed at that time 
(Huffman 1971; Clowes 1971; Waltz 1975; Rosenfeld, Hummel, and Zucker 1976; Kanade 
1980). Nalwa (1993) gives a nice review of this area. The topic of edge detection was also 
an active area of research; a nice survey of contemporaneous work can be found in (Davis 
1975). 

Three-dimensional modeling of non-polyhedral objects was also being studied (Baum- 
gart 1974; Baker 1977). One popular approach used generalized cylinders, 1.e., solids of 
revolution and swept closed curves (Agin and Binford 1976; Nevatia and Binford 1977), of- 
ten arranged into parts relationships” (Hinton 1977; Marr 1982) (Figure 1.7c). Fischler and 
Elschlager (1973) called such elastic arrangements of parts pictorial structures (Figure 1.7b). 

A qualitative approach to understanding intensities and shading variations and explaining 
them by the effects of image formation phenomena, such as surface orientation and shadows, 
was championed by Barrow and Tenenbaum (1981) in their paper on intrinsic images (Fig- 
ure 1.7d), along with the related 214 -D sketch ideas of Marr (1982). This approach has seen 


TIn robotics and computer animation, these linked-part graphs are often called kinematic chains. 
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periodic revivals, e.g., in the work of Tappen, Freeman, and Adelson (2005) and Barron and 
Malik (2012). 

More quantitative approaches to computer vision were also developed at the time, in- 
cluding the first of many feature-based stereo correspondence algorithms (Figure 1.7e) (Dev 
1974; Marr and Poggio 1976, 1979; Barnard and Fischler 1982; Ohta and Kanade 1985; 
Grimson 1985; Pollard, Mayhew, and Frisby 1985) and intensity-based optical flow algo- 
rithms (Figure 1.7f) (Horn and Schunck 1981; Huang 1981; Lucas and Kanade 1981; Nagel 
1986). The early work in simultaneously recovering 3D structure and camera motion (see 
Chapter 11) also began around this time (Ullman 1979; Longuet-Higgins 1981). 

A lot of the philosophy of how vision was believed to work at the time is summarized 
in David Marr's (1982) book.® In particular, Marr introduced his notion of the three levels 
of description of a (visual) information processing system. These three levels, very loosely 


paraphrased according to my own interpretation, are: 


e Computational theory: What is the goal of the computation (task) and what are the 


constraints that are known or can be brought to bear on the problem? 


Representations and algorithms: How are the input, output, and intermediate infor- 


mation represented and which algorithms are used to calculate the desired result? 


Hardware implementation: How are the representations and algorithms mapped onto 
actual hardware, e.g., a biological vision system or a specialized piece of silicon? Con- 
versely, how can hardware constraints be used to guide the choice of representation and 
algorithm? With the prevalent use of graphics chips (GPUs) and many-core architec- 


tures for computer vision, this question is again quite relevant. 


As I mentioned earlier in this introduction, it is my conviction that a careful analysis of the 
problem specification and known constraints from image formation and priors (the scientific 
and statistical approaches) must be married with efficient and robust algorithms (the engineer- 
ing approach) to design successful vision algorithms. Thus, it seems that Marr’s philosophy 


is as good a guide to framing and solving problems in our field today as it was 25 years ago. 


1980s. In the 1980s, a lot of attention was focused on more sophisticated mathematical 
techniques for performing quantitative image and scene analysis. 

Image pyramids (see Section 3.5) started being widely used to perform tasks such as im- 
age blending (Figure 1.8a) and coarse-to-fine correspondence search (Rosenfeld 1980; Burt 


8More recent developments in visual perception theory are covered in (Wandell 1995; Palmer 1999; Livingstone 
2008; Frisby and Stone 2010). 
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Figure 1.8 Examples of computer vision algorithms from the 1980s: (a) pyramid blending 
(Burt and Adelson 1983b) © 1983 ACM, (b) shape from shading (Freeman and Adelson 1991) 
© 1991 IEEE, (c) edge detection (Freeman and Adelson 1991) © 1991 IEEE, (d) physically 
based models (Terzopoulos and Witkin 1988) © 1988 IEEE, (e) regularization-based surface 
reconstruction (Terzopoulos 1988) © 1988 IEEE, (f) range data acquisition and merging 
(Banno, Masuda et al. 2008) © 2008 Springer. 


and Adelson 1983b; Rosenfeld 1984; Quam 1984; Anandan 1989). Continuous versions of 
pyramids using the concept of scale-space processing were also developed (Witkin 1983; 
Witkin, Terzopoulos, and Kass 1986; Lindeberg 1990). In the late 1980s, wavelets (see Sec- 
tion 3.5.4) started displacing or augmenting regular image pyramids in some applications 
(Mallat 1989; Simoncelli and Adelson 1990a; Simoncelli, Freeman et al. 1992). 

The use of stereo as a quantitative shape cue was extended by a wide variety of shape- 
from-X techniques, including shape from shading (Figure 1.8b) (see Section 13.1.1 and Horn 
1975; Pentland 1984; Blake, Zisserman, and Knowles 1985; Horn and Brooks 1986, 1989), 
photometric stereo (see Section 13.1.1 and Woodham 1981), shape from texture (see Sec- 
tion 13.1.2 and Witkin 1981; Pentland 1984; Malik and Rosenholtz 1997), and shape from 
focus (see Section 13.1.3 and Nayar, Watanabe, and Noguchi 1995). Horn (1986) has a nice 
discussion of most of these techniques. 

Research into better edge and contour detection (Figure 1.8c) (see Section 7.2) was also 
active during this period (Canny 1986; Nalwa and Binford 1986), including the introduc- 
tion of dynamically evolving contour trackers (Section 7.3.1) such as snakes (Kass, Witkin, 
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and Terzopoulos 1988), as well as three-dimensional physically based models (Figure 1.8d) 
(Terzopoulos, Witkin, and Kass 1987; Kass, Witkin, and Terzopoulos 1988; Terzopoulos and 
Fleischer 1988). 

Researchers noticed that a lot of the stereo, flow, shape-from-X, and edge detection al- 
gorithms could be unified, or at least described, using the same mathematical framework if 
they were posed as variational optimization problems and made more robust (well-posed) 
using regularization (Figure 1.8e) (see Section 4.2 and Terzopoulos 1983; Poggio, Torre, 
and Koch 1985; Terzopoulos 1986b; Blake and Zisserman 1987; Bertero, Poggio, and Torre 
1988; Terzopoulos 1988). Around the same time, Geman and Geman (1984) pointed out that 
such problems could equally well be formulated using discrete Markov random field (MRF) 
models (see Section 4.3), which enabled the use of better (global) search and optimization 
algorithms, such as simulated annealing. 

Online variants of MRF algorithms that modeled and updated uncertainties using the 
Kalman filter were introduced a little later (Dickmanns and Graefe 1988; Matthies, Kanade, 
and Szeliski 1989; Szeliski 1989). Attempts were also made to map both regularized and 
MRE algorithms onto parallel hardware (Poggio and Koch 1985; Poggio, Little et al. 1988; 
Fischler, Firschein et al. 1989). The book by Fischler and Firschein (1987) contains a nice 
collection of articles focusing on all of these topics (stereo, flow, regularization, MRFs, and 
even higher-level vision). 

Three-dimensional range data processing (acquisition, merging, modeling, and recogni- 
tion; see Figure 1.8f) continued being actively explored during this decade (Agin and Binford 
1976; Besl and Jain 1985; Faugeras and Hebert 1987; Curless and Levoy 1996). The compi- 
lation by Kanade (1987) contains a lot of the interesting papers in this area. 


1990s. While a lot of the previously mentioned topics continued to be explored, a few of 
them became significantly more active. 

A burst of activity in using projective invariants for recognition (Mundy and Zisserman 
1992) evolved into a concerted effort to solve the structure from motion problem (see Chap- 
ter 11). A lot of the initial activity was directed at projective reconstructions, which did 
not require knowledge of camera calibration (Faugeras 1992; Hartley, Gupta, and Chang 
1992; Hartley 1994a; Faugeras and Luong 2001; Hartley and Zisserman 2004). Simultane- 
ously, factorization techniques (Section 11.4.1) were developed to solve efficiently problems 
for which orthographic camera approximations were applicable (Figure 1.9a) (Tomasi and 
Kanade 1992; Poelman and Kanade 1997; Anandan and Irani 2002) and then later extended 
to the perspective case (Christy and Horaud 1996; Triggs 1996). Eventually, the field started 
using full global optimization (see Section 11.4.2 and Taylor, Kriegman, and Anandan 1991; 
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(a) (b) 


(e) 


Figure 1.9 Examples of computer vision algorithms from the 1990s: (a) factorization- 
based structure from motion (Tomasi and Kanade 1992) O 1992 Springer, (b) dense stereo 
matching (Boykov, Veksler, and Zabih 2001), (c) multi-view reconstruction (Seitz and Dyer 
1999) O 1999 Springer, (d) face tracking (Matthews, Xiao, and Baker 2007), (e) image seg- 
mentation (Belongie, Fowlkes et al. 2002) O 2002 Springer, (f) face recognition (Turk and 
Pentland 1991). 


Szeliski and Kang 1994; Azarbayejani and Pentland 1995), which was later recognized as 
being the same as the bundle adjustment techniques traditionally used in photogrammetry 
(Triggs, McLauchlan et al. 1999). Fully automated 3D modeling systems were built using 
such techniques (Beardsley, Torr, and Zisserman 1996; Schaffalitzky and Zisserman 2002; 
Snavely, Seitz, and Szeliski 2006; Agarwal, Furukawa et al. 2011; Frahm, Fite-Georgel et al. 
2010). 

Work begun in the 1980s on using detailed measurements of color and intensity combined 
with accurate physical models of radiance transport and color image formation created its own 
subfield known as physics-based vision. A good survey of the field can be found in the three- 
volume collection on this topic (Wolff, Shafer, and Healey 1992a; Healey and Shafer 1992; 
Shafer, Healey, and Wolff 1992). 

Optical flow methods (see Chapter 9) continued to be improved (Nagel and Enkelmann 
1986; Bolles, Baker, and Marimont 1987; Horn and Weldon Jr. 1988; Anandan 1989; Bergen, 
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Anandan et al. 1992; Black and Anandan 1996; Bruhn, Weickert, and Schnórr 2005; Papen- 
berg, Bruhn et al. 2006), with (Nagel 1986; Barron, Fleet, and Beauchemin 1994; Baker, 
Scharstein et al. 2011) being good surveys. Similarly, a lot of progress was made on dense 
stereo correspondence algorithms (see Chapter 12, Okutomi and Kanade (1993, 1994); Boykov, 
Veksler, and Zabih (1998); Birchfield and Tomasi (1999); Boykov, Veksler, and Zabih (2001), 
and the survey and comparison in Scharstein and Szeliski (2002)), with the biggest break- 
through being perhaps global optimization using graph cut techniques (Figure 1.9b) (Boykov, 
Veksler, and Zabih 2001). 

Multi-view stereo algorithms (Figure 1.9c) that produce complete 3D surfaces (see Sec- 
tion 12.7) were also an active topic of research (Seitz and Dyer 1999; Kutulakos and Seitz 
2000) that continues to be active today (Seitz, Curless et al. 2006; Schóps, Schönberger et 
al. 2017; Knapitsch, Park et al. 2017). Techniques for producing 3D volumetric descriptions 
from binary silhouettes (see Section 12.7.3) continued to be developed (Potmesil 1987; Sri- 
vasan, Liang, and Hackwood 1990; Szeliski 1993; Laurentini 1994), along with techniques 
based on tracking and reconstructing smooth occluding contours (see Section 12.2.1 and 
Cipolla and Blake 1992; Vaillant and Faugeras 1992; Zheng 1994; Boyer and Berger 1997; 
Szeliski and Weiss 1998; Cipolla and Giblin 2000). 


Tracking algorithms also improved a lot, including contour tracking using active contours 
(see Section 7.3), such as snakes (Kass, Witkin, and Terzopoulos 1988), particle filters (Blake 
and Isard 1998), and level sets (Malladi, Sethian, and Vemuri 1995), as well as intensity-based 
(direct) techniques (Lucas and Kanade 1981; Shi and Tomasi 1994; Rehg and Kanade 1994), 
often applied to tracking faces (Figure 1.9d) (Lanitis, Taylor, and Cootes 1997; Matthews and 
Baker 2004; Matthews, Xiao, and Baker 2007) and whole bodies (Sidenbladh, Black, and 
Fleet 2000; Hilton, Fua, and Ronfard 2006; Moeslund, Hilton, and Kriiger 2006). 

Image segmentation (see Section 7.5) (Figure 1.9e), a topic which has been active since 
the earliest days of computer vision (Brice and Fennema 1970; Horowitz and Pavlidis 1976; 
Riseman and Arbib 1977; Rosenfeld and Davis 1979; Haralick and Shapiro 1985; Pavlidis 
and Liow 1990), was also an active topic of research, producing techniques based on min- 
imum energy (Mumford and Shah 1989) and minimum description length (Leclerc 1989), 
normalized cuts (Shi and Malik 2000), and mean shift (Comaniciu and Meer 2002). 

Statistical learning techniques started appearing, first in the application of principal com- 
ponent eigenface analysis to face recognition (Figure 1.9f) (see Section 5.2.3 and Turk and 
Pentland 1991) and linear dynamical systems for curve tracking (see Section 7.3.1 and Blake 
and Isard 1998). 

Perhaps the most notable development in computer vision during this decade was the 


increased interaction with computer graphics (Seitz and Szeliski 1999), especially in the 
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(d) 


Figure 1.10 Examples of computer vision algorithms from the 2000s: (a) image-based 
rendering (Gortler, Grzeszczuk et al. 1996), (b) image-based modeling (Debevec, Taylor, and 
Malik 1996) O 1996 ACM, (c) interactive tone mapping (Lischinski, Farbman et al. 2006) (d) 
texture synthesis (Efros and Freeman 2001), (e) feature-based recognition (Fergus, Perona, 
and Zisserman 2007), (f) region-based recognition (Mori, Ren et al. 2004) O 2004 IEEE. 


cross-disciplinary area of image-based modeling and rendering (see Chapter 14). The idea of 
manipulating real-world imagery directly to create new animations first came to prominence 
with image morphing techniques (Figurel.5c) (see Section 3.6.3 and Beier and Neely 1992) 
and was later applied to view interpolation (Chen and Williams 1993; Seitz and Dyer 1996), 
panoramic image stitching (Figurel.5a) (see Section 8.2 and Mann and Picard 1994; Chen 
1995; Szeliski 1996; Szeliski and Shum 1997; Szeliski 2006a), and full light-field rendering 
(Figure 1.10a) (see Section 14.3 and Gortler, Grzeszczuk et al. 1996; Levoy and Hanrahan 
1996; Shade, Gortler et al. 1998). At the same time, image-based modeling techniques (Fig- 
ure 1.10b) for automatically creating realistic 3D models from collections of images were also 
being introduced (Beardsley, Torr, and Zisserman 1996; Debevec, Taylor, and Malik 1996; 
Taylor, Debevec, and Malik 1996). 


2000s. This decade continued to deepen the interplay between the vision and graphics 
fields, but more importantly embraced data-driven and learning approaches as core compo- 
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nents of vision. Many of the topics introduced under the rubric of image-based rendering, 
such as image stitching (see Section 8.2), light-field capture and rendering (see Section 14.3), 
and high dynamic range (HDR) image capture through exposure bracketing (Figure 1.5b) (see 
Section 10.2 and Mann and Picard 1995; Debevec and Malik 1997), were re-christened as 
computational photography (see Chapter 10) to acknowledge the increased use of such tech- 
niques in everyday digital photography. For example, the rapid adoption of exposure brack- 
eting to create high dynamic range images necessitated the development of tone mapping 
algorithms (Figure 1.10c) (see Section 10.2.1) to convert such images back to displayable 
results (Fattal, Lischinski, and Werman 2002; Durand and Dorsey 2002; Reinhard, Stark et 
al. 2002; Lischinski, Farbman et al. 2006). In addition to merging multiple exposures, tech- 
niques were developed to merge flash images with non-flash counterparts (Eisemann and 
Durand 2004; Petschnigg, Agrawala et al. 2004) and to interactively or automatically select 
different regions from overlapping images (Agarwala, Dontcheva ef al. 2004). 


Texture synthesis (Figure 1.10d) (see Section 10.5), quilting (Efros and Leung 1999; Efros 
and Freeman 2001; Kwatra, Schódl et al. 2003), and inpainting (Bertalmio, Sapiro et al. 
2000; Bertalmio, Vese et al. 2003; Criminisi, Pérez, and Toyama 2004) are additional topics 
that can be classified as computational photography techniques, since they re-combine input 
image samples to produce new photographs. 

A second notable trend during this decade was the emergence of feature-based techniques 
(combined with learning) for object recognition (see Section 6.1 and Ponce, Hebert et al. 
2006). Some of the notable papers in this area include the constellation model of Fergus, 
Perona, and Zisserman (2007) (Figure 1.10e) and the pictorial structures of Felzenszwalb 
and Huttenlocher (2005). Feature-based techniques also dominate other recognition tasks, 
such as scene recognition (Zhang, Marszalek ef al. 2007) and panorama and location recog- 
nition (Brown and Lowe 2007; Schindler, Brown, and Szeliski 2007). And while interest 
point (patch-based) features tend to dominate current research, some groups are pursuing 
recognition based on contours (Belongie, Malik, and Puzicha 2002) and region segmentation 
(Figure 1.10f) (Mori, Ren et al. 2004). 


Another significant trend from this decade was the development of more efficient al- 
gorithms for complex global optimization problems (see Chapter 4 and Appendix B.5 and 
Szeliski, Zabih et al. 2008; Blake, Kohli, and Rother 2011). While this trend began with 
work on graph cuts (Boykov, Veksler, and Zabih 2001; Kohli and Torr 2007), a lot of progress 
has also been made in message passing algorithms, such as loopy belief propagation (LBP) 
(Yedidia, Freeman, and Weiss 2001; Kumar and Torr 2006). 


The most notable trend from this decade, which has by now completely taken over visual 


recognition and most other aspects of computer vision, was the application of sophisticated 
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Figure 1.11 Examples of computer vision algorithms from the 2010s: (a) the SuperVision 
deep neural network © Krizhevsky, Sutskever, and Hinton (2012); (b) object instance seg- 
mentation (He, Gkioxari et al. 2017) O 2017 IEEE; (c) whole body, expression, and gesture 
fitting from a single image (Pavlakos, Choutas et al. 2019) O 2019 IEEE; (d) fusing mul- 
tiple color depth images using the KinectFusion real-time system (Newcombe, Izadi et al. 
2011) O 2011 IEEE; (e) smartphone augmented reality with real-time depth occlusion effects 
(Valentin, Kowdle et al. 2018) © 2018 ACM; (f) 3D map computed in real-time on a fully 
autonomous Skydio R1 drone (Cross 2019). 


machine learning techniques to computer vision problems (see Chapters 5 and 6). This trend 
coincided with the increased availability of immense quantities of partially labeled data on 
the internet, as well as significant increases in computational power, which makes it more 


feasible to learn object categories without the use of careful human supervision. 


2010s. The trend towards using large labeled (and also self-supervised) datasets to develop 
machine learning algorithms became a tidal wave that totally revolutionized the development 
of image recognition algorithms as well as other applications, such as denoising and optical 
flow, which previously used Bayesian and global optimization techniques. 

This trend was enabled by the development of high-quality large-scale annotated datasets 
such as ImageNet (Deng, Dong et al. 2009; Russakovsky, Deng et al. 2015), Microsoft COCO 
(Common Objects in Context) (Lin, Maire et al. 2014), and LVIS (Gupta, Dollar, and Gir- 
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shick 2019). These datasets provided not only reliable metrics for tracking the progress of 
recognition and semantic segmentation algorithms, but more importantly, sufficient labeled 


data to develop complete solutions based on machine learning. 


Another major trend was the dramatic increase in computational power available from 
the development of general purpose (data-parallel) algorithms on graphical processing units 
(GPGPU). The breakthrough SuperVision (“AlexNet”) deep neural network (Figure 1.11a; 
Krizhevsky, Sutskever, and Hinton 2012), which was the first neural network to win the 
yearly ImageNet large-scale visual recognition challenge, relied on GPU training, as well 
as a number of technical advances, for its dramatic performance. After the publication of 
this paper, progress in using deep convolutional architectures accelerated dramatically, to the 
point where they are now the only architecture considered for recognition and semantic seg- 
mentation tasks (Figure 1.11b), as well as the preferred architecture for many other vision 
tasks (Chapter 5; LeCun, Bengio, and Hinton 2015), including optical flow (Sun, Yang et al. 
2018)), denoising, and monocular depth inference (Li, Dekel et al. 2019). 


Large datasets and GPU architectures, coupled with the rapid dissemination of ideas 
through timely publications on arXiv as well as the development of languages for deep learn- 
ing and the open sourcing of neural network models, all contributed to an explosive growth 
in this area, both in rapid advances and capabilities, and also in the sheer number of publica- 
tions and researchers now working on these topics. They also enabled the extension of image 
recognition approaches to video understanding tasks such as action recognition (Feichten- 
hofer, Fan et al. 2019), as well as structured regression tasks such as real-time multi-person 
body pose estimation (Cao, Simon ef al. 2017). 


Specialized sensors and hardware for computer vision tasks also continued to advance. 
The Microsoft Kinect depth camera, released in 2010, quickly became an essential component 
of many 3D modeling (Figure 1.11d) and person tracking (Shotton, Fitzgibbon et al. 2011) 
systems. Over the decade, 3D body shape modeling and tracking systems continued to evolve, 
to the point where it is now possible to infer a person’s 3D model with gestures and expression 
from a single image (Figure 1.11c). 

And while depth sensors have not yet become ubiquitous (except for security applications 
on high-end phones), computational photography algorithms run on all of today’s smart- 
phones. Innovations introduced in the computer vision community, such as panoramic image 
stitching and bracketed high dynamic range image merging, are now standard features, and 
multi-image low-light denoising algorithms are also becoming commonplace (Liba, Murthy 
et al. 2019). Lightfield imaging algorithms, which allow the creation of soft depth-of-field 
effects, are now also becoming more available (Garg, Wadhwa et al. 2019). Finally, mo- 


bile augmented reality applications that perform real-time pose estimation and environment 
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augmentation using combinations of feature tracking and inertial measurements are com- 
monplace, and are currently being extended to include pixel-accurate depth occlusion effects 
(Figure 1.11e). 

On higher-end platforms such as autonomous vehicles and drones, powerful real-time 
SLAM (simultaneous localization and mapping) and VIO (visual inertial odometry) algo- 
rithms (Engel, Schóps, and Cremers 2014; Forster, Zhang et al. 2017; Engel, Koltun, and 
Cremers 2018) can build accurate 3D maps that enable, e.g., autonomous flight through chal- 
lenging scenes such as forests (Figure 1.11f). 

In summary, this past decade has seen incredible advances in the performance and reli- 
ability of computer vision algorithms, brought in part by the shift to machine learning and 
training on very large sets of real-world data. It has also seen the application of vision algo- 
rithms in myriad commercial and consumer scenarios as well as new challenges engendered 
by their widespread use (Su and Crandall 2021). 


1.3 Book overview 


In the final part of this introduction, I give a brief tour of the material in this book, as well 
as a few notes on notation and some additional general references. Since computer vision is 
such a broad field, it is possible to study certain aspects of it, e.g., geometric image formation 
and 3D structure recovery, without requiring other parts, e.g., the modeling of reflectance and 
shading. Some of the chapters in this book are only loosely coupled with others, and it is not 
strictly necessary to read all of the material in sequence. 

Figure 1.12 shows a rough layout of the contents of this book. Since computer vision 
involves going from images to both a semantic understanding as well as a 3D structural de- 
scription of the scene, I have positioned the chapters horizontally in terms of where in this 
spectrum they land, in addition to vertically according to their dependence.” 

Interspersed throughout the book are sample applications, which relate the algorithms 
and mathematical material being presented in various chapters to useful, real-world applica- 
tions. Many of these applications are also presented in the exercises sections, so that students 
can write their own. 

At the end of each section, I provide a set of exercises that the students can use to imple- 
ment, test, and refine the algorithms and techniques presented in each section. Some of the 


exercises are suitable as written homework assignments, others as shorter one-week projects, 


For an interesting comparison with what is known about the human visual system, e.g., the largely parallel 
what and where pathways (Goodale and Milner 1992), see some textbooks on human perception (Palmer 1999; 
Livingstone 2008; Frisby and Stone 2010). 
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Figure 1.12 A taxonomy of the topics covered in this book, showing the (rough) depen- 
dencies between different chapters, which are roughly positioned along the left-right axis 
depending on whether they are more closely related to images (left) or 3D geometry (right) 
representations. The “what-where” along the top axis is a reference to separate visual path- 
ways in the visual system (Goodale and Milner 1992), but should not be taken too seriously. 
Foundational techniques such as optimization and deep learning are widely used in subse- 


quent chapters. 
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and still others as open-ended research problems that make for challenging final projects. 
Motivated students who implement a reasonable subset of these exercises will, by the end of 
the book, have a computer vision software library that can be used for a variety of interesting 
tasks and projects. 

If the students or curriculum do not have a strong preference for programming languages, 
Python, with the NumPy scientific and array arithmetic library plus the OpenCV vision li- 
brary, are a good environment to develop algorithms and learn about vision. Not only will the 
students learn how to program using array/tensor notation and linear/matrix algebra (which is 
a good foundation for later use of PyTorch for deep learning), you can also prepare classroom 
assignments using Jupyter notebooks, giving you the option to combine descriptive tutorials, 
sample code, and code to be extended/modified in one convenient location.!° 

As this is a reference book, I try wherever possible to discuss which techniques and al- 
gorithms work well in practice, as well as provide up-to-date pointers to the latest research 
results in the areas that I cover. The exercises can be used to build up your own personal 
library of self-tested and validated vision algorithms, which is more worthwhile in the long 
term (assuming you have the time) than simply pulling algorithms out of a library whose 
performance you do not really understand. 

The book begins in Chapter 2 with a review of the image formation processes that create 
the images that we see and capture. Understanding this process is fundamental if you want 
to take a scientific (model-based) approach to computer vision. Students who are eager to 
just start implementing algorithms (or courses that have limited time) can skip ahead to the 
next chapter and dip into this material later. In Chapter 2, we break down image formation 
into three major components. Geometric image formation (Section 2.1) deals with points, 
lines, and planes, and how these are mapped onto images using projective geometry and other 
models (including radial lens distortion). Photometric image formation (Section 2.2) covers 
radiometry, which describes how light interacts with surfaces in the world, and optics, which 
projects light onto the sensor plane. Finally, Section 2.3 covers how sensors work, including 
topics such as sampling and aliasing, color sensing, and in-camera compression. 

Chapter 3 covers image processing, which is needed in almost all computer vision appli- 
cations. This includes topics such as linear and non-linear filtering (Section 3.3), the Fourier 
transform (Section 3.4), image pyramids and wavelets (Section 3.5), and geometric transfor- 
mations such as image warping (Section 3.6). Chapter 3 also presents applications such as 
seamless image blending and image morphing. 


Chapter 4 begins with a new section on data fitting and interpolation, which provides a 


lO You may also be able to run your notebooks and train your models using the Google Colab service at https: 


//colab.research.google.com. 
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Figure 1.13 A pictorial summary of the chapter contents. Sources: Burt and Adelson 
(1983b); Agarwala, Dontcheva et al. (2004); Glassner (2018); He, Gkioxari et al. (2017); 
Brown, Szeliski, and Winder (2005); Butler, Wulff et al. (2012); Debevec and Malik (1997); 
Snavely, Seitz, and Szeliski (2006); Scharstein, Hirschmüller et al. (2014); Curless and Levoy 
(1996); Gortler, Grzeszczuk et al. (1996)—see the figures in the respective chapters for copy- 
right information. 
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conceptual framework for global optimization techniques such as regularization and Markov 
random fields (MRFs), as well as machine learning, which we cover in the next chapter. Sec- 
tion 4.2 covers classic regularization techniques, i.e., piecewise-continuous smoothing splines 
(aka variational techniques) implemented using fast iterated linear system solvers, which are 
still often the method of choice in time-critical applications such as mobile augmented reality. 
The next section (4.3) presents the related topic of MRFs, which also serve as an introduc- 
tion to Bayesian inference techniques, covered at a more abstract level in Appendix B. The 


chapter also discusses applications to interactive colorization and segmentation. 


Chapter 5 is a completely new chapter covering machine learning, deep learning, and 
deep neural networks. It begins in Section 5.1 with a review of classic supervised machine 
learning approaches, which are designed to classify images (or regress values) based on 
intermediate-level features. Section 5.2 looks at unsupervised learning, which is useful for 
both understanding unlabeled training data and providing models of real-world distributions. 
Section 5.3 presents the basic elements of feedforward neural networks, including weights, 
layers, and activation functions, as well as methods for network training. Section 5.4 goes 
into more detail on convolutional networks and their applications to both recognition and im- 
age processing. The last section in the chapter discusses more complex networks, including 


3D, spatio-temporal, recurrent, and generative networks. 


Chapter 6 covers the topic of recognition. In the first edition of this book this chapter 
came last, since it built upon earlier methods such as segmentation and feature matching. 
With the advent of deep networks, many of these intermediate representations are no longer 
necessary, since the network can learn them as part of the training process. As so much of 
computer vision research is now devoted to various recognition topics, I decided to move this 


chapter up so that students can learn about it earlier in the course. 


The chapter begins with the classic problem of instance recognition, i.e., finding instances 
of known 3D objects in cluttered scenes. Section 6.2 covers both traditional and deep network 
approaches to whole image classification, 1.e., what used to be called category recognition. It 
also discusses the special case of facial recognition. Section 6.3 presents algorithms for object 
detection (drawing bounding boxes around recognized objects), with a brief review of older 
approaches to face and pedestrian detection. Section 6.4 covers various flavors of semantic 
segmentation (generating per-pixel labels), including instance segmentation (delineating sep- 
arate objects), pose estimation (labeling pixels with body parts), and panoptic segmentation 
(labeling both things and stuff). In Section 6.5, we briefly look at some recent papers in video 
understanding and action recognition, while in Section 6.6 we mention some recent work in 
image captioning and visual question answering. 


In Chapter 7, we cover feature detection and matching. A lot of current 3D reconstruction 
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and recognition techniques are built on extracting and matching feature points (Section 7.1), 
so this is a fundamental technique required by many subsequent chapters (Chapters 8 and 
11) and even in instance recognition (Section 6.1). We also cover edge and straight line 
detection in Sections 7.2 and 7.4, contour tracking in Section 7.3, and low-level segmentation 


techniques in Section 7.5. 


Feature detection and matching are used in Chapter 8 to perform image alignment (or reg- 
istration) and image stitching. We introduce the basic techniques of feature-based alignment 
and show how this problem can be solved using either linear or non-linear least squares, de- 
pending on the motion involved. We also introduce additional concepts, such as uncertainty 
weighting and robust regression, which are essential to making real-world systems work. 
Feature-based alignment is then used as a building block for both 2D applications such as 
image stitching (Section 8.2) and computational photography (Chapter 10), as well as 3D 


geometric alignment tasks such as pose estimation and structure from motion (Chapter 11). 


The second part of Chapter 8 is devoted to image stitching, 1.e., the construction of large 
panoramas and composites. While stitching is just one example of computational photog- 
raphy (see Chapter 10), there is enough depth here to warrant a separate section. We start 
by discussing various possible motion models (Section 8.2.1), including planar motion and 
pure camera rotation. We then discuss global alignment (Section 8.3), which is a special 
(simplified) case of general bundle adjustment, and then present panorama recognition, i.e., 
techniques for automatically discovering which images actually form overlapping panoramas. 
Finally, we cover the topics of image compositing and blending (Section 8.4), which involve 
both selecting which pixels from which images to use and blending them together so as to 


disguise exposure differences. 


Image stitching is a wonderful application that ties together most of the material covered 
in earlier parts of this book. It also makes for a good mid-term course project that can build on 
previously developed techniques such as image warping and feature detection and matching. 
Sections 8.2-8.4 also present more specialized variants of stitching such as whiteboard and 
document scanning, video summarization, panography, full 360° spherical panoramas, and 


interactive photomontage for blending repeated action shots together. 


In Chapter 9, we generalize the concept of feature-based image alignment to cover dense 
intensity-based motion estimation, i.e., optical flow. We start with the simplest possible 
motion models, translational motion (Section 9.1), and cover topics such as hierarchical 
(coarse-to-fine) motion estimation, Fourier-based techniques, and iterative refinement. We 
then present parametric motion models, which can be used to compensate for camera rota- 
tion and zooming, as well as affine or planar perspective motion (Section 9.2). This is then 


generalized to spline-based motion models (Section 9.2.2) and finally to general per-pixel 
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optical flow (Section 9.3). We close the chapter in Section 9.4 with a discussion of layered 
and learned motion models as well as video object segmentation and tracking. Applications 
of motion estimation techniques include automated morphing, video denoising, and frame 
interpolation (slow motion). 


Chapter 10 presents additional examples of computational photography, which is the pro- 
cess of creating new images from one or more input photographs, often based on the careful 
modeling and calibration of the image formation process (Section 10.1). Computational pho- 
tography techniques include merging multiple exposures to create high dynamic range images 
(Section 10.2), increasing image resolution through blur removal and super-resolution (Sec- 
tion 10.3), and image editing and compositing operations (Section 10.4). We also cover the 
topics of texture analysis, synthesis, and inpainting (hole filling) in Section 10.5, as well as 


non-photorealistic rendering and style transfer. 


Starting in Chapter 11, we delve more deeply into techniques for reconstructing 3D mod- 
els from images. We begin by introducing methods for intrinsic camera calibration in Sec- 
tion 11.1 and 3D pose estimation, i.e., extrinsic calibration, in Section 11.2. These sections 
also describe the applications of single-view reconstruction of building models and 3D loca- 
tion recognition. We then cover the topic of triangulation (Section 11.2.4), which is the 3D 


reconstruction of points from matched features when the camera positions are known. 


Chapter 11 then moves on to the topic of structure from motion, which involves the simul- 
taneous recovery of 3D camera motion and 3D scene structure from a collection of tracked 
2D features. We begin with two-frame structure from motion (Section 11.3), for which al- 
gebraic techniques exist, as well as robust sampling techniques such as RANSAC that can 
discount erroneous feature matches. We then cover techniques for multi-frame structure 
from motion, including factorization (Section 11.4.1), bundle adjustment (Section 11.4.2), 
and constrained motion and structure models (Section 11.4.8). We present applications in 
visual effects (match move) and sparse 3D model construction for large (e.g., internet) photo 
collections. The final part of this chapter (Section 11.5) has a new section on simultaneous 
localization and mapping (SLAM) as well as its applications to autonomous navigation and 
mobile augmented reality (AR). 


In Chapter 12, we turn to the topic of stereo correspondence, which can be thought of 
as a special case of motion estimation where the camera positions are already known (Sec- 
tion 12.1). This additional knowledge enables stereo algorithms to search over a much smaller 
space of correspondences to produce dense depth estimates using various combinations of 
matching criteria, optimization algorithm, and/or deep networks (Sections 12.3-12.6). We 
also cover multi-view stereo algorithms that build a true 3D surface representation instead 


of just a single depth map (Section 12.7), as well as monocular depth inference algorithms 
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that hallucinate depth maps from just a single image (Section 12.8). Applications of stereo 
matching include head and gaze tracking, as well as depth-based background replacement 
(Z-keying). 

Chapter 13 covers additional 3D shape and appearance modeling techniques. These in- 
clude classic shape-from-X techniques such as shape from shading, shape from texture, and 
shape from focus (Section 13.1). An alternative to all of these passive computer vision tech- 
niques is to use active rangefinding (Section 13.2), i.e., to project patterned light onto scenes 
and recover the 3D geometry through triangulation. Processing all of these 3D representations 
often involves interpolating or simplifying the geometry (Section 13.3), or using alternative 
representations such as surface point sets (Section 13.4) or implicit functions (Section 13.5). 

The collection of techniques for going from one or more images to partial or full 3D 
models is often called image-based modeling or 3D photography. Section 13.6 examines 
three more specialized application areas (architecture, faces, and human bodies), which can 
use model-based reconstruction to fit parameterized models to the sensed data. Section 13.7 
examines the topic of appearance modeling, 1.e., techniques for estimating the texture maps, 
albedos, or even sometimes complete bi-directional reflectance distribution functions (BRDFs) 
that describe the appearance of 3D surfaces. 

In Chapter 14, we discuss the large number of image-based rendering techniques that 
have been developed in the last three decades, including simpler techniques such as view in- 
terpolation (Section 14.1), layered depth images (Section 14.2), and sprites and layers (Sec- 
tion 14.2.1), as well as the more general framework of light fields and Lumigraphs (Sec- 
tion 14.3) and higher-order fields such as environment mattes (Section 14.4). Applications of 
these techniques include navigating 3D collections of photographs using photo tourism. 

Next, we discuss video-based rendering, which is the temporal extension of image-based 
rendering. The topics we cover include video-based animation (Section 14.5.1), periodic 
video turned into video textures (Section 14.5.2), and 3D video constructed from multiple 
video streams (Section 14.5.4). Applications of these techniques include animating still im- 
ages and creating home tours based on 360° video. We finish the chapter with an overview of 
the new emerging field of neural rendering. 

To support the book’s use as a textbook, the appendices and associated website contain 
more detailed mathematical topics and additional material. Appendix A covers linear algebra 
and numerical techniques, including matrix algebra, least squares, and iterative techniques. 
Appendix B covers Bayesian estimation theory, including maximum likelihood estimation, 
robust statistics, Markov random fields, and uncertainty modeling. Appendix C describes 
the supplementary material that can be used to complement this book, including images and 


datasets, pointers to software, and course slides. 
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Week Chapter Topics 

1. Chapters 1-2 Introduction and image formation 
2. Chapter 3 Image processing 
3. Chapters 4-5 Optimization and learning 
4. Chapter 5 Deep learning 
5. Chapter 6 Recognition 
6. Chapter 7 Feature detection and matching 
7. Chapter 8 Image alignment and stitching 
8. Chapter 9 Motion estimation 
9. Chapter 10 Computational photography 

10. Chapter 11 Structure from motion 

11. Chapter 12 Depth estimation 

12. Chapter 13 3D reconstruction 

13. Chapter 14 Image-based rendering 


Table 1.1 Sample syllabus for a one semester 13-week course. A 10-week quarter could go 


into lesser depth or omit some topics. 


1.4 Sample syllabus 


Teaching all of the material covered in this book in a single quarter or semester course is a 
Herculean task and likely one not worth attempting.!! It is better to simply pick and choose 
topics related to the lecturer’s preferred emphasis and tailored to the set of mini-projects 
envisioned for the students. 

Steve Seitz and I have successfully used a 10-week syllabus similar to the one shown 
in Table 1.1 as both an undergraduate and a graduate-level course in computer vision. The 
undergraduate course!? tends to go lighter on the mathematics and takes more time reviewing 
basics, while the graduate-level course!” dives more deeply into techniques and assumes the 
students already have a decent grounding in either vision or related mathematical techniques. 
Related courses have also been taught on the topics of 3D photography and computational 
photography. Appendix C.3 and the book’s website list other courses that use this book to 
teach a similar curriculum. 


1 Some universities, such as Stanford (CS231A & 231N), Berkeley (CS194-26/294-26 & 280), and the University 
of Michigan (EECS 498/598 & 442), now split the material over two courses. 

12 http://www.cs.washington.edu/education/courses/455 

‘3 http://www.cs.washington.edu/education/courses/576 
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When Steve and I teach the course, we prefer to give the students several small program- 
ming assignments early in the course rather than focusing on written homework or quizzes. 
With a suitable choice of topics, it is possible for these projects to build on each other. For ex- 
ample, introducing feature matching early on can be used in a second assignment to do image 
alignment and stitching. Alternatively, direct (optical flow) techniques can be used to do the 
alignment and more focus can be put on either graph cut seam selection or multi-resolution 
blending techniques. 

In the past, we have also asked the students to propose a final project (we provide a set of 
suggested topics for those who need ideas) by the middle of the course and reserved the last 
week of the class for student presentations. Sometimes, a few of these projects have actually 
turned into conference submissions! 

No matter how you decide to structure the course or how you choose to use this book, 
I encourage you to try at least a few small programming tasks to get a feel for how vision 
techniques work and how they fail. Better yet, pick topics that are fun and can be used on 
your own photographs, and try to push your creative boundaries to come up with surprising 
results. 


15 A note on notation 


For better or worse, the notation found in computer vision and multi-view geometry textbooks 
tends to vary all over the map (Faugeras 1993; Hartley and Zisserman 2004; Girod, Greiner, 
and Niemann 2000; Faugeras and Luong 2001; Forsyth and Ponce 2003). In this book, I 
use the convention I first learned in my high school physics class (and later multi-variate 
calculus and computer graphics courses), which is that vectors v are lower case bold, matrices 
M are upper case bold, and scalars (T,s) are mixed case italic. Unless otherwise noted, 
vectors operate as column vectors, i.e., they post-multiply matrices, Mv, although they are 
sometimes written as comma-separated parenthesized lists x = (x, y) instead of bracketed 
column vectors x = [x y]?. Some commonly used matrices are R for rotations, K for 
calibration matrices, and I for the identity matrix. Homogeneous coordinates (Section 2.1) 
are denoted with a tilde over the vector, e.g., X = (%, 9, W) = W(x,y,1) = Wx in P?. The 


cross product operator in matrix form is denoted by |]. 


1.6 Additional reading 


This book attempts to be self-contained, so that students can implement the basic assignments 


and algorithms described here without the need for outside references. However, it does pre- 
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suppose a general familiarity with basic concepts in linear algebra and numerical techniques, 
which are reviewed in Appendix A, and image processing, which is reviewed in Chapter 3. 

Students who want to delve more deeply into these topics can look in Golub and Van 
Loan (1996) for matrix algebra and Strang (1988) for linear algebra. In image processing, 
there are a number of popular textbooks, including Crane (1997), Gomes and Velho (1997), 
Jahne (1997), Pratt (2007), Russ (2007), Burger and Burge (2008), and Gonzalez and Woods 
(2017). For computer graphics, popular texts include Hughes, van Dam et al. (2013) and 
Marschner and Shirley (2015), with Glassner (1995) providing a more in-depth look at image 
formation and rendering. For statistics and machine learning, Chris Bishop’s (2006) book 
is a wonderful and comprehensive introduction with a wealth of exercises, while Murphy 
(2012) provides a more recent take on the field and Hastie, Tibshirani, and Friedman (2009) 
a more classic treatment. A great introductory text to deep learning is Glassner (2018), while 
Goodfellow, Bengio, and Courville (2016) and Zhang, Lipton et al. (2021) provide more 
comprehensive treatments. Students may also want to look in other textbooks on computer 
vision for material that we do not cover here, as well as for additional project ideas (Nalwa 
1993; Trucco and Verri 1998; Hartley and Zisserman 2004; Forsyth and Ponce 2011; Prince 
2012; Davies 2017). 

There is, however, no substitute for reading the latest research literature, both for the 
latest ideas and techniques and for the most up-to-date references to related literature.!* In 
this book, I have attempted to cite the most recent work in each field so that students can read 
them directly and use them as inspiration for their own work. Browsing the last few years’ 
conference proceedings from the major vision, graphics, and machine learning conferences, 
such as CVPR, ECCV, ICCV, SIGGRAPH, and NeurIPS, as well as keeping an eye out for 
the latest publications on arXiv, will provide a wealth of new ideas. The tutorials offered at 
these conferences, for which slides or notes are often available online, are also an invaluable 


resource. 


'4For a comprehensive bibliography and taxonomy of computer vision research, Keith Price’s Annotated Com- 


puter Vision Bibliography https://www.visionbib.com/bibliography/contents.html is an invaluable resource. 
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(c) 


Figure 2.1 A few components of the image formation process: (a) perspective projection; 


(b) light scattering when hitting a surface; (c) lens optics; (d) Bayer color filter array. 


2 Image formation 35 


Before we can analyze and manipulate images, we need to establish a vocabulary for de- 
scribing the geometry of a scene. We also need to understand the image formation process 
that produced a particular image given a set of lighting conditions, scene geometry, surface 
properties, and camera optics. In this chapter, we present a simplified model of this image 


formation process. 


Section 2.1 introduces the basic geometric primitives used throughout the book (points, 
lines, and planes) and the geometric transformations that project these 3D quantities into 2D 
image features (Figure 2.1a). Section 2.2 describes how lighting, surface properties (Fig- 
ure 2.1b), and camera optics (Figure 2.1c) interact to produce the color values that fall onto 
the image sensor. Section 2.3 describes how continuous color images are turned into discrete 
digital samples inside the image sensor (Figure 2.1d) and how to avoid (or at least character- 


1ze) sampling deficiencies, such as aliasing. 


The material covered in this chapter is but a brief summary of a very rich and deep set of 
topics, traditionally covered in a number of separate fields. A more thorough introduction to 
the geometry of points, lines, planes, and projections can be found in textbooks on multi-view 
geometry (Hartley and Zisserman 2004; Faugeras and Luong 2001) and computer graphics 
(Hughes, van Dam et al. 2013). The image formation (synthesis) process is traditionally 
taught as part of a computer graphics curriculum (Glassner 1995; Watt 1995; Hughes, van 
Dam et al. 2013; Marschner and Shirley 2015) but it is also studied in physics-based computer 
vision (Wolff, Shafer, and Healey 1992a). The behavior of camera lens systems is studied in 
optics (Möller 1988; Ray 2002; Hecht 2015). Some good books on color theory are Healey 
and Shafer (1992), Wandell (1995), and Wyszecki and Stiles (2000), with Livingstone (2008) 
providing a more fun and informal introduction to the topic of color perception. Topics 
relating to sampling and aliasing are covered in textbooks on signal and image processing 
(Crane 1997; Jáhne 1997; Oppenheim and Schafer 1996; Oppenheim, Schafer, and Buck 
1999; Pratt 2007; Russ 2007; Burger and Burge 2008; Gonzalez and Woods 2017). The 
recent book by Ikeuchi, Matsushita et al. (2020) also covers 3D geometry, photometry, and 


sensor models, with an emphasis on active illumination systems. 


A note to students: If you have already studied computer graphics, you may want to 
skim the material in Section 2.1, although the sections on projective depth and object-centered 
projection near the end of Section 2.1.4 may be new to you. Similarly, physics students (as 
well as computer graphics students) will mostly be familiar with Section 2.2. Finally, students 
with a good background in image processing will already be familiar with sampling issues 


(Section 2.3) as well as some of the material in Chapter 3. 
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2.1 Geometric primitives and transformations 


In this section, we introduce the basic 2D and 3D primitives used in this textbook, namely 
points, lines, and planes. We also describe how 3D features are projected into 2D features. 
More detailed descriptions of these topics (along with a gentler and more intuitive introduc- 
tion) can be found in textbooks on multiple-view geometry (Hartley and Zisserman 2004; 
Faugeras and Luong 2001). 

Geometric primitives form the basic building blocks used to describe three-dimensional 
shapes. In this section, we introduce points, lines, and planes. Later sections of the book 


discuss curves (Sections 7.3 and 12.2), surfaces (Section 13.3), and volumes (Section 13.5). 


2D points. 2D points (pixel coordinates in an image) can be denoted using a pair of values, 


x= A (2.1) 
y 


(As stated in the introduction, we use the (x1, £2, ...) notation to denote column vectors.) 


x = (x,y) € R?, or alternatively, 


2D points can also be represented using homogeneous coordinates, X = (%,9,W) € P?, 
where vectors that differ only by scale are considered to be equivalent. P? = R? — (0,0, 0) 
is called the 2D projective space. 

A homogeneous vector x can be converted back into an inhomogeneous vector x by di- 


viding through by the last element w, i.e., 
x = (2,9, Ww) = w(z,y, 1) = wx, (2.2) 


where X = (x, y, 1) is the augmented vector. Homogeneous points whose last element is w = 
0 are called ideal points or points at infinity and do not have an equivalent inhomogeneous 


representation. 


2D lines. 2D lines can also be represented using homogeneous coordinates i= (a,b,c). 


The corresponding line equation is 
x-I=ax+by+c=0. (2.3) 


We can normalize the line equation vector so that 1 = (fiz, Ay, d) = (ñ, d) with |||] = 1. In 
this case, ñ is the normal vector perpendicular to the line and d is its distance to the origin 
(Figure 2.2). (The one exception to this normalization is the line at infinity l= (0,0, 1), 
which includes all (ideal) points at infinity.) 
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(b) 


Figure 2.2 (a) 2D line equation and (b) 3D plane equation, expressed in terms of the 


normal vi and distance to the origin d. 


We can also express ñ as a function of rotation angle 9, i = (Az, y) = (cos 0, sin 0) 
(Figure 2.2a). This representation is commonly used in the Hough transform line-finding 
algorithm, which is discussed in Section 7.4.2. The combination (0, d) is also known as 
polar coordinates. 


When using homogeneous coordinates, we can compute the intersection of two lines as 


x= la x l, (2.4) 
where x is the cross product operator. Similarly, the line joining two points can be written as 
1=X, x X. (2.5) 


When trying to fit an intersection point to multiple lines or, conversely, a line to multiple 
points, least squares techniques (Section 8.1.1 and Appendix A.2) can be used, as discussed 


in Exercise 2.1. 


2D conics. There are other algebraic curves that can be expressed with simple polynomial 
homogeneous equations. For example, the conic sections (so called because they arise as the 


intersection of a plane and a 3D cone) can be written using a quadric equation 
x Qk = 0. (2.6) 


Quadric equations play useful roles in the study of multi-view geometry and camera calibra- 
tion (Hartley and Zisserman 2004; Faugeras and Luong 2001) but are not used extensively in 
this book. 


3D points. Point coordinates in three dimensions can be written using inhomogeneous co- 


ordinates x = (x, y, z) € R or homogeneous coordinates X = (Z, y, 2,0) € P*. As before, 
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Figure 2.3 3D line equation, r = (1 — \)p+ Aq. 


it is sometimes useful to denote a 3D point using the augmented vector X = (x, y, z, 1) with 


X = WX. 


3D planes. 3D planes can also be represented as homogeneous coordinates M = (a, b, c, d) 


with a corresponding plane equation 
x.m=ar+by+cz+d=0. (2.7) 


We can also normalize the plane equation as m = (fiz, Ay, ùz, d) = (A, d) with ||fil| = 1. 
In this case, ñ is the normal vector perpendicular to the plane and d is its distance to the 
origin (Figure 2.2b). As with the case of 2D lines, the plane at infinity m = (0,0,0,1), 
which contains all the points at infinity, cannot be normalized (1.e., it does not have a unique 
normal or a finite distance). 


We can express ñ as a function of two angles (0, ¢), 
ñ = (cos 0 cos ¢, sin 8 cos ¢, sin ¢), (2.8) 


i.e., using spherical coordinates, but these are less commonly used than polar coordinates 


since they do not uniformly sample the space of possible normal vectors. 


3D lines. Lines in 3D are less elegant than either lines in 2D or planes in 3D. One possible 
representation is to use two points on the line, (p,q). Any other point on the line can be 


expressed as a linear combination of these two points 
r= (1—A)p+Aaq, (2.9) 


as shown in Figure 2.3. If we restrict 0 < A < 1, we get the line segment joining p and q. 


If we use homogeneous coordinates, we can write the line as 


Ë = pp + rq. (2.10) 
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A special case of this is when the second point is at infinity, i.e., q = (de, dy, de, 0) = (d, 0). 
Here, we see that d is the direction of the line. We can then re-write the inhomogeneous 3D 
line equation as 

r=p+Ad. (2.11) 


A disadvantage of the endpoint representation for 3D lines is that it has too many degrees 
of freedom, i.e., six (three for each endpoint) instead of the four degrees that a 3D line truly 
has. However, if we fix the two points on the line to lie in specific planes, we obtain a rep- 
resentation with four degrees of freedom. For example, if we are representing nearly vertical 
lines, then z = O and z = 1 form two suitable planes, i.e., the (x, y) coordinates in both 
planes provide the four coordinates describing the line. This kind of two-plane parameteri- 
zation is used in the light field and Lumigraph image-based rendering systems described in 
Chapter 14 to represent the collection of rays seen by a camera as it moves in front of an 
object. The two-endpoint representation is also useful for representing line segments, even 
when their exact endpoints cannot be seen (only guessed at). 

If we wish to represent all possible lines without bias towards any particular orientation, 
we can use Pliicker coordinates (Hartley and Zisserman 2004, Section 3.2; Faugeras and 
Luong 2001, Chapter 3). These coordinates are the six independent non-zero entries in the 4 
x 4 skew symmetric matrix 

L = pq’ - ap”, (2.12) 


where p and q are any two (non-identical) points on the line. This representation has only 
four degrees of freedom, since L is homogeneous and also satisfies |L| = 0, which results in 
a quadratic constraint on the Pliicker coordinates. 

In practice, the minimal representation is not essential for most applications. An ade- 
quate model of 3D lines can be obtained by estimating their direction (which may be known 
ahead of time, e.g., for architecture) and some point within the visible portion of the line 
(see Section 11.4.8) or by using the two endpoints, since lines are most often visible as fi- 
nite line segments. However, if you are interested in more details about the topic of minimal 
line parameterizations, Fórstner (2005) discusses various ways to infer and model 3D lines in 


projective geometry, as well as how to estimate the uncertainty in such fitted models. 
3D quadrics. The 3D analog of a conic section is a quadric surface 


x Qx =0 (2.13) 


(Hartley and Zisserman 2004, Chapter 3). Again, while quadric surfaces are useful in the 
study of multi-view geometry and can also serve as useful modeling primitives (spheres, 


ellipsoids, cylinders), we do not study them in great detail in this book. 
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similarity ¿> projective 
translation 


Euclidean 


Figure 2.4 Basic set of 2D planar transformations. 


2.1.1 2D transformations 


Having defined our basic primitives, we can now turn our attention to how they can be trans- 
formed. The simplest transformations occur in the 2D plane are illustrated in Figure 2.4. 


Translation. 2D translations can be written as x’ = x + t or 


x! = [1 t] z, (2.14) 
where I is the (2 x 2) identity matrix or 
x! T tg (2.15) 
x= X : 
or 1]? 


where 0 is the zero vector. Using a 2 x 3 matrix results in a more compact notation, whereas 
using a full-rank 3 x 3 matrix (which can be obtained from the 2 x 3 matrix by appending a 
[07 1] row) makes it possible to chain transformations using matrix multiplication as well as 
to compute inverse transforms. Note that in any equation where an augmented vector such as 
X appears on both sides, it can always be replaced with a full homogeneous vector X. 


Rotation + translation. This transformation is also known as 2D rigid body motion or the 
2D Euclidean transformation (since Euclidean distances are preserved). It can be written as 
x’ = Rx+tor 


x! = IR t] z. (2.16) 
where 
R= cog? —sind (2.17) 
sinf cos 


is an orthonormal rotation matrix with RR? = I and |R] = 1. 
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Scaled rotation. Also known as the similarity transform, this transformation can be ex- 
pressed as x’ = sRx + t, where s is an arbitrary scale factor. It can also be written as 
a —b t 
x! = [sR t] x= 2| z, (2.18) 
b a ty 
where we no longer require that a? + b? = 1. The similarity transform preserves angles 
between lines. 


Affine. The affine transformation is written as x’ = Ax, where A is an arbitrary 2 x 3 
matrix, i.e., 
a a a 
ga 00 01 02 a (2.19) 


410 411 012 


Parallel lines remain parallel under affine transformations. 


Projective. This transformation, also known as a perspective transform or homography, 


operates on homogeneous coordinates, 
x’ = Hx, (2.20) 


where H is an arbitrary 3 x 3 matrix. Note that H is homogeneous, i.e., it is only defined 
up to a scale, and that two H matrices that differ only by scale are equivalent. The resulting 
homogeneous coordinate X’ must be normalized in order to obtain an inhomogeneous result 


X, 1.€., 


hoox hoy hoz hiot + hr14 + hig 
gr = and y = . 2.21 
hox hay h22 sl hoot + hay + h22 ( ) 


Perspective transformations preserve straight lines (i.e., they remain straight after the trans- 


formation). 


Hierarchy of 2D transformations. The preceding set of transformations are illustrated in 
Figure 2.4 and summarized in Table 2.1. The easiest way to think of them is as a set of 
(potentially restricted) 3 x 3 matrices operating on 2D homogeneous coordinate vectors. 
Hartley and Zisserman (2004) contains a more detailed description of the hierarchy of 2D 
planar transformations. 

The above transformations form a nested set of groups, 1.e., they are closed under compo- 
sition and have an inverse that is a member of the same group. (This will be important later 
when applying these transformations to images in Section 3.6.) Each (simpler) group is a 


subgroup of the more complex group below it. The mathematics of such Lie groups and their 
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Transformation Matrix #DoF Preserves Icon 
translation È | 2 orientation 
2x3 
rigid (Euclidean) IR tl 3 lengths $ 
2x3 
similarity [sR t] 4 angles — 
2x3 
affine [4] 6 parallelism E. 
2x3 
projective El 8 straight lines a 
3x3 


Table 2.1 Hierarchy of 2D coordinate transformations, listing the transformation name, its 
matrix form, the number of degrees of freedom, what geometric properties it preserves, and 
amnemonic icon. Each transformation also preserves the properties listed in the rows below 
it, ie., similarity preserves not only angles but also parallelism and straight lines. The 2 x 
3 matrices are extended with a third [07 1] row to form a full 3 x 3 matrix for homogeneous 


coordinate transformations. 


related algebras (tangent spaces at the origin) are discussed in a number of recent robotics 
tutorials (Dellaert and Kaess 2017; Blanco 2019; Sola, Deray, and Atchuthan 2019), where 
the 2D rotation and rigid transforms are called SO(2) and SE(2), which stand for the special 


orthogonal and special Euclidean groups.' 


Co-vectors. While the above transformations can be used to transform points in a 2D plane, 
can they also be used directly to transform a line equation? Consider the homogeneous equa- 


tion 1- X = 0. If we transform x’ = Hx, we obtain 
1.x = ITA = (ATT) =1-% =0, (2.22) 


i.e., Y = H~71. Thus, the action of a projective transformation on a co-vector such as a 2D 
line or 3D normal can be represented by the transposed inverse of the matrix, which is equiv- 
alent to the adjoint of H, since projective transformation matrices are homogeneous. Jim 
Blinn (1998) describes (in Chapters 9 and 10) the ins and outs of notating and manipulating 


co-vectors. 


The term special refers to the desired condition of no reflection, i.e., det|R| = 1. 
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While the above transformations are the ones we use most extensively, a number of addi- 


tional transformations are sometimes used. 


Stretch/squash. This transformation changes the aspect ratio of an image, 


g = 822 + to 


1 


y= Syy + ty, 


and is a restricted form of an affine transformation. Unfortunately, it does not nest cleanly 


with the groups listed in Table 2.1. 


Planar surface flow. This eight-parameter transformation (Horn 1986; Bergen, Anandan et 
al. 1992; Girod, Greiner, and Niemann 2000), 


1 2 
£ = Q0 + 011 + agy + 061” + a7xry 


y = az + 441 + a5y + ag1y + ary’, 


arises when a planar surface undergoes a small 3D motion. It can thus be thought of as a 
small motion approximation to a full homography. Its main attraction is that it is linear in the 


motion parameters, az, which are often the quantities being estimated. 


Bilinear interpolant. This eight-parameter transform (Wolberg 1990), 


x = ao + a£ + 42Y + 06TY 


y = az + 041 + asy + 0714, 

can be used to interpolate the deformation due to the motion of the four corner points of 
a square. (In fact, it can interpolate the motion of any four non-collinear points.) While 
the deformation is linear in the motion parameters, it does not generally preserve straight 
lines (only lines parallel to the square axes). However, it is often quite useful, e.g., in the 


interpolation of sparse grids using splines (Section 9.2.2). 


2.1.2 3D transformations 


The set of three-dimensional coordinate transformations is very similar to that available for 
2D transformations and is summarized in Table 2.2. As in 2D, these transformations form a 
nested set of groups. Hartley and Zisserman (2004, Section 2.4) give a more detailed descrip- 


tion of this hierarchy. 
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Transformation Matrix #DoF Preserves Icon 
translation È | 3 orientation 
3x4 
rigid (Euclidean) IR tl 6 lengths $ 
3x4 
similarity [sR t] 7 angles — 
3x4 
affine [4] 12 parallelism E. 
3x4 
projective El 15 straight lines CI 
4x4 


Table 2.2 Hierarchy of 3D coordinate transformations. Each transformation also pre- 
serves the properties listed in the rows below it, i.e., similarity preserves not only angles but 
also parallelism and straight lines. The 3 x 4 matrices are extended with a fourth [07 1] 
row to form a full 4 x 4 matrix for homogeneous coordinate transformations. The mnemonic 


icons are drawn in 2D but are meant to suggest transformations occurring in a full 3D cube. 


Translation. 3D translations can be written as x’ = x + t or 
x= [1 t] z, (2.23) 


where I is the (3 x 3) identity matrix. 


Rotation + translation. Also known as 3D rigid body motion or the 3D Euclidean trans- 
formation or SE(3), it can be written as x’ = Rx + t or 


y= IR t] x (2.24) 


where R is a 3 x 3 orthonormal rotation matrix with RR” = I and |R| = 1. Note that 


sometimes it is more convenient to describe a rigid motion using 
x’ = R(x- c) = Rx — Rc, (2.25) 


where c is the center of rotation (often the camera center). 
Compactly parameterizing a 3D rotation is a non-trivial task, which we describe in more 
detail below. 
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Scaled rotation. The 3D similarity transform can be expressed as x’ = sRx + t where s 
is an arbitrary scale factor. It can also be written as 


= [sR tl z. (2.26) 


This transformation preserves angles between lines and planes. 


Affine. The affine transform is written as x’ = AX, where A is an arbitrary 3 x 4 matrix, 
i.e., 
aoo 401 G02 403 
x’ = [aj a1 dí 413] X. (2.27) 
loo a21 Q22 an 


Parallel lines and planes remain parallel under affine transformations. 


Projective. This transformation, variously known as a 3D perspective transform, homogra- 


phy, or collineation, operates on homogeneous coordinates, 
x’ = Hx, (2.28) 


where H is an arbitrary 4 x 4 homogeneous matrix. As in 2D, the resulting homogeneous 
coordinate X’ must be normalized in order to obtain an inhomogeneous result x. Perspective 


transformations preserve straight lines (i.e., they remain straight after the transformation). 


2.1.3 3D rotations 


The biggest difference between 2D and 3D coordinate transformations is that the parameter- 
ization of the 3D rotation matrix R is not as straightforward, as several different possibilities 
exist. 


Euler angles 


A rotation matrix can be formed as the product of three rotations around three cardinal axes, 
e.g., £, y, and z, or x, y, and x. This is generally a bad idea, as the result depends on the 
order in which the transforms are applied.” What is worse, it is not always possible to move 
smoothly in the parameter space, i.e., sometimes one or more of the Euler angles change 


dramatically in response to a small change in rotation.? For these reasons, we do not even 


2 However, in special situations, such as describing the motion of a pan-tilt head, these angles may be more 
intuitive. 
3In robotics, this is sometimes referred to as gimbal lock. 
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Figure 2.5 Rotation around an axis fi by an angle 0. 


give the formula for Euler angles in this book—interested readers can look in other textbooks 
or technical reports (Faugeras 1993; Diebel 2006). Note that, in some applications, if the 
rotations are known to be a set of uni-axial transforms, they can always be represented using 
an explicit set of rigid transformations. 


Axis/angle (exponential twist) 


A rotation can be represented by a rotation axis ñ and an angle 0, or equivalently by a 3D 
vector w = 0ñ. Figure 2.5 shows how we can compute the equivalent rotation. First, we 


project the vector v onto the axis fi to obtain 
vj =ñ(á  v) = (An )v, (2.29) 


which is the component of v that is not affected by the rotation. Next, we compute the 


perpendicular residual of v from ñ, 
vi =v- v= (I1- fn’ )v. (2.30) 
We can rotate this vector by 90° using the cross product, 


vx =x v]; =fAx v = fû]xv, (2.31) 


where [fi]. is the matrix form of the cross product operator with the vector fi = (Àz, Ay, ùz), 


0 A, fy 
fa], =| à 0 ôl. (2.32) 
fe Me 0 


Note that rotating this vector by another 90° is equivalent to taking the cross product again, 


Vxxx =AXvVy = [av =-—v1, 
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and hence 


Vy =V—V1 =V+Vxx = (I+ [ñ]})v. 
We can now compute the in-plane component of the rotated vector u as 
u, =cosOv, +sin Ov, = (sin 0[ñ]. — cos 0 [A]? )v. 
Putting all these terms together, we obtain the final rotated vector as 
u=u, + vj = (I+sin Off], + (1 — cos 6) [fl )v. (2.33) 


We can therefore write the rotation matrix corresponding to a rotation by 0 around an axis ni 
as 
R(ñ, 0) = I + sin fâ]; + (1 — cos 6) [â]? (2.34) 


x? 
which is known as Rodrigues’ formula (Ayache 1989). 

The product of the axis ñ and angle 0, w = 0 = (wr, Wy, wz), is a minimal represen- 
tation for a 3D rotation. Rotations through common angles such as multiples of 90° can be 
represented exactly (and converted to exact matrices) if 0 is stored in degrees. Unfortunately, 
this representation is not unique, since we can always add a multiple of 360° (27 radians) to 
0 and get the same rotation matrix. As well, (fi, 0) and (—ñ, —0) represent the same rotation. 

However, for small rotations (e.g., corrections to rotations), this is an excellent choice. 
In particular, for small (infinitesimal or instantaneous) rotations and @ expressed in radians, 


Rodrigues’ formula simplifies to 


1 —Wz Wy 
R(w) ~ I + sin bfû]x ~ I+ [6A]. = | wz 1 —Wa |, (2.35) 
Wy Wa 1 


which gives a nice linearized relationship between the rotation parameters w and R. We can 
also write R(w)v ~ v +w x v, which is handy when we want to compute the derivative of 


Rv with respect to w, 


0 z -y 

OR 

SOT =-lvi=!l-z2 0 xl. (2.36) 
y -xr 0 


Another way to derive a rotation through a finite angle is called the exponential twist 
(Murray, Li, and Sastry 1994). A rotation by an angle 0 is equivalent to k rotations through 
0/k. In the limit as k — oo, we obtain 


k=>00 


R(ñ,9) = lim (I+ = päh)" = exp [w]x. (2.37) 
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Figure 2.6 Unit quaternions live on the unit sphere ||q|| = 1. This figure shows a smooth 
trajectory through the three quaternions qo, qı, and q2. The antipodal point to q2, namely 


—Qp, represents the same rotation as qv. 


If we expand the matrix exponential as a Taylor series (using the identity [a]k*? = —[1J%, 
k > 0, and again assuming 0 is in radians), 
a Pa | a 
exp lo] =1+ Oli]. + FIA + lal + 
63 a a. 64 e 
=I+ (0 tât AR 
=1+sin6[f],, + (1 — cos 0) [â], (2.38) 


which yields the familiar Rodrigues’ formula. 

In robotics (and group theory), rotations are called SO(3), i.e., the special orthogonal 
group in 3D. The incremental rotations w are associated with a Lie algebra se(3) and are 
the preferred way to formulate rotation derivatives and to model uncertainties in rotation 
estimates (Blanco 2019; Solà, Deray, and Atchuthan 2019). 


Unit quaternions 


The unit quaternion representation is closely related to the angle/axis representation. A unit 
quaternion is a unit length 4-vector whose components can be written as q = (qz, qy, qz, qw) 
or q = (x, y, z, w) for short. Unit quaternions live on the unit sphere ||q|| = 1 and antipodal 
(opposite sign) quaternions, q and —q, represent the same rotation (Figure 2.6). Other than 
this ambiguity (dual covering), the unit quaternion representation of a rotation is unique. 
Furthermore, the representation is continuous, 1.e., as rotation matrices vary continuously, 
you can find a continuous quaternion representation, although the path on the quaternion 


sphere may wrap all the way around before returning to the “origin” qo = (0,0,0,1). For 
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these and other reasons given below, quaternions are a very popular representation for pose 
and for pose interpolation in computer graphics (Shoemake 1985). 


Quaternions can be derived from the axis/angle representation through the formula 


0 
q = (v, w) = (sin zô cos a (2.39) 


where fi and 0 are the rotation axis and angle. Using the trigonometric identities sin? = 
2 sin g cos g and (1 — cos 0) = 2 sin? g, Rodrigues’ formula can be converted to 
R(ñ, 0) = I + sin 6[f],. + (1 — cos 0) [â]} 
=I + 2w[v]x + 2[v]}. (2.40) 


This suggests a quick way to rotate a vector v by a quaternion using a series of cross products, 


scalings, and additions. To obtain a formula for R(q) as a function of (x, y, z, w), recall that 


0 -z y y? — 2? xy LZ 
Wwie= | z —x| and [v|, = ry x? — 2? yz 
—y z 0 LZ yz —r? — y? 


We thus obtain 
li —2(y2+27) Waxy — zw) 2(xz + yw) 
R(q)= | 2xy+zw) 1-2x?+2) 2yz-xw) |. (2.41) 
| Maz — yw) 2(yz+ zw)  1-2x?*+ A 


The diagonal terms can be made more symmetrical by replacing 1 — 2(y? + 27) with (a? + 
w? — y? — 22), etc. 

The nicest aspect of unit quaternions is that there is a simple algebra for composing ro- 
tations expressed as unit quaternions. Given two quaternions qo = (Vvo, wo) and qı = 


(vı, w1), the quaternion multiply operator is defined as 
q2 = qoqı = (Vo X Vi + WoV1 + WIVO, WoW1 — Vo: V1), (2.42) 


with the property that R(q2) = R(qo)R(qi). Note that quaternion multiplication is not 
commutative, just as 3D rotations and matrix multiplications are not. 

Taking the inverse of a quaternion is easy: Just flip the sign of v or w (but not both!). 
(You can verify this has the desired effect of transposing the R matrix in (2.41).) Thus, we 


can also define quaternion division as 


q2 = 90/q1 = 909, * = (Vo X Vi + WoV1 — WIVO, —Wow1 — Vo ` V1). (2.43) 
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procedure slerp(qo, q1, Q): 
1. qr = 91/90 = (Vr, wr) 
2. if w, < 0 then q, + ~qr 
3. 0, =2tan *(||v,||/10,) 
4. A, = N (v,) = vr /l[vrl| 
Oa Oa 


6. qa = (sin £4ñ,, cos “) 


7. return q2 = dado 


Algorithm 2.1 Spherical linear interpolation (slerp). The axis and total angle are first 
computed from the quaternion ratio. (This computation can be lifted outside an inner loop 
that generates a set of interpolated position for animation.) An incremental quaternion is 


then computed and multiplied by the starting rotation quaternion. 


This is useful when the incremental rotation between two rotations is desired. 

In particular, if we want to determine a rotation that is partway between two given rota- 
tions, we can compute the incremental rotation, take a fraction of the angle, and compute the 
new rotation. This procedure is called spherical linear interpolation or slerp for short (Shoe- 
make 1985) and is given in Algorithm 2.1. Note that Shoemake presents two formulas other 
than the one given here. The first exponentiates q, by alpha before multiplying the original 
quaternion, 

q2 = q; Qo, (2.44) 


while the second treats the quaternions as 4-vectors on a sphere and uses 


sin(1 — a) sin 00 
q2 = : qo - q1, 
sin 0 sin 4 


(2.45) 


where 6 = cos” *(qo - q1) and the dot product is directly between the quaternion 4-vectors. 
All of these formulas give comparable results, although care should be taken when qo and q; 
are close together, which is why I prefer to use an arctangent to establish the rotation angle. 


Which rotation representation is better? 


The choice of representation for 3D rotations depends partly on the application. 
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The axis/angle representation is minimal, and hence does not require any additional con- 
straints on the parameters (no need to re-normalize after each update). If the angle is ex- 
pressed in degrees, it is easier to understand the pose (say, 90° twist around z-axis), and also 
easier to express exact rotations. When the angle is in radians, the derivatives of R with 
respect to w can easily be computed (2.36). 

Quaternions, on the other hand, are better if you want to keep track of a smoothly moving 
camera, since there are no discontinuities in the representation. It is also easier to interpolate 
between rotations and to chain rigid transformations (Murray, Li, and Sastry 1994; Bregler 
and Malik 1998). 

My usual preference is to use quaternions, but to update their estimates using an incre- 


mental rotation, as described in Section 11.2.2. 


2.1.4 3D to 2D projections 


Now that we know how to represent 2D and 3D geometric primitives and how to transform 
them spatially, we need to specify how 3D primitives are projected onto the image plane. We 
can do this using a linear 3D to 2D projection matrix. The simplest model is orthography, 
which requires no division to get the final (inhomogeneous) result. The more commonly used 


model is perspective, since this more accurately models the behavior of real cameras. 


Orthography and para-perspective 


An orthographic projection simply drops the z component of the three-dimensional coordi- 
nate p to obtain the 2D point x. (In this section, we use p to denote 3D points and x to denote 


2D points.) This can be written as 
x= [T> 210] p. (2.46) 


If we are using homogeneous (projective) coordinates, we can write 
1000 
x=l|0 1 0 0O|P, (2.47) 
0 0 0 1 


i.e., we drop the z component but keep the w component. Orthography is an approximate 
model for long focal length (telephoto) lenses and objects whose depth is shallow relative 
to their distance to the camera (Sawhney and Hanson 1991). It is exact only for telecentric 
lenses (Baker and Nayar 1999, 2001). 
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(a) 3D view 


(c) scaled orthography 


(e) perspective (f) object-centered 


Figure 2.7 Commonly used projection models: (a) 3D view of world, (b) orthography, (c) 
scaled orthography, (d) para-perspective, (e) perspective, (f) object-centered. Each diagram 
shows a top-down view of the projection. Note how parallel lines on the ground plane and 
box sides remain parallel in the non-perspective projections. 
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In practice, world coordinates (which may measure dimensions in meters) need to be 
scaled to fit onto an image sensor (physically measured in millimeters, but ultimately mea- 


sured in pixels). For this reason, scaled orthography is actually more commonly used, 
x = [sI2x2|0] p. (2.48) 


This model is equivalent to first projecting the world points onto a local fronto-parallel image 
plane and then scaling this image using regular perspective projection. The scaling can be the 
same for all parts of the scene (Figure 2.7b) or it can be different for objects that are being 
modeled independently (Figure 2.7c). More importantly, the scaling can vary from frame to 
frame when estimating structure from motion, which can better model the scale change that 
occurs as an object approaches the camera. 

Scaled orthography is a popular model for reconstructing the 3D shape of objects far away 
from the camera, since it greatly simplifies certain computations. For example, pose (camera 
orientation) can be estimated using simple least squares (Section 11.2.1). Under orthography, 
structure and motion can simultaneously be estimated using factorization (singular value de- 
composition), as discussed in Section 11.4.1 (Tomasi and Kanade 1992). 

A closely related projection model is para-perspective (Aloimonos 1990; Poelman and 
Kanade 1997). In this model, object points are again first projected onto a local reference 
parallel to the image plane. However, rather than being projected orthogonally to this plane, 
they are projected parallel to the line of sight to the object center (Figure 2.7d). This is 
followed by the usual projection onto the final image plane, which again amounts to a scaling. 


The combination of these two projections is therefore affine and can be written as 


aoo Go Q02 03 
x= 410 G11 012 413 p. (2.49) 
0 0 0 1 


Note how parallel lines in 3D remain parallel after projection in Figure 2.7b—d. Para-perspective 
provides a more accurate projection model than scaled orthography, without incurring the 
added complexity of per-pixel perspective division, which invalidates traditional factoriza- 
tion methods (Poelman and Kanade 1997). 


Perspective 


The most commonly used projection in computer graphics and computer vision is true 3D 


perspective (Figure 2.7e). Here, points are projected onto the image plane by dividing them 
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by their z component. Using inhomogeneous coordinates, this can be written as 
X = P.(p) = y/z|. (2.50) 
In homogeneous coordinates, the projection has a simple linear form, 
1 000 
x=|0 1 0 O|p, (2.51) 
0 0 1 0 


i.e., we drop the w component of p. Thus, after projection, it is not possible to recover the 
distance of the 3D point from the image, which makes sense for a 2D imaging sensor. 

A form often seen in computer graphics systems is a two-step projection that first projects 
3D coordinates into normalized device coordinates (x,y,z) € [-1,1] x [-1,1] x [0, 1], and 
then rescales these coordinates to integer pixel coordinates using a viewport transformation 
(Watt 1995; OpenGL-ARB 1997). The (initial) perspective projection is then represented 


using a4 x 4 matrix 


1 0 0 0 
0 1 0 0 
x= 5, (2.52) 
0 0 Hd Tras Znearžfar/ Zrange 
0 0 1 0 


where Znear and Zfar are the near and far z clipping planes and Zrange = far — Znear. Note 
that the first two rows are actually scaled by the focal length and the aspect ratio so that 
visible rays are mapped to (x,y,z) € [-1,1]?. The reason for keeping the third row, rather 
than dropping it, is that visibility operations, such as z-buffering, require a depth for every 
graphical element that is being rendered. 

If we set Znear = 1, Zfar —> oo, and switch the sign of the third row, the third element 
of the normalized screen vector becomes the inverse depth, i.e., the disparity (Okutomi and 
Kanade 1993). This can be quite convenient in many cases since, for cameras moving around 
outdoors, the inverse depth to the camera is often a more well-conditioned parameterization 
than direct 3D distance. 

While a regular 2D image sensor has no way of measuring distance to a surface point, 
range sensors (Section 13.2) and stereo matching algorithms (Chapter 12) can compute such 
values. It is then convenient to be able to map from a sensor-based depth or disparity value d 
directly back to a 3D location using the inverse of a 4 x 4 matrix (Section 2.1.4). We can do 
this if we represent perspective projection using a full-rank 4 x 4 matrix, as in (2.64). 
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Figure 2.8 Projection of a 3D camera-centered point p< onto the sensor planes at location 
p. O. is the optical center (nodal point), c, is the 3D origin of the sensor plane coordinate 


system, and Sx and sy are the pixel spacings. 


Camera intrinsics 


Once we have projected a 3D point through an ideal pinhole using a projection matrix, we 
must still transform the resulting coordinates according to the pixel sensor spacing and the 
relative position of the sensor plane to the origin. Figure 2.8 shows an illustration of the 
geometry involved. In this section, we first present a mapping from 2D pixel coordinates to 
3D rays using a sensor homography M6, since this is easier to explain in terms of physically 
measurable quantities. We then relate these quantities to the more commonly used camera in- 
trinsic matrix K, which is used to map 3D camera-centered points pe to 2D pixel coordinates 
Xs. 

Image sensors return pixel values indexed by integer pixel coordinates (£s, Ys), often 
with the coordinates starting at the upper-left corner of the image and moving down and to 
the right. (This convention is not obeyed by all imaging libraries, but the adjustment for 
other coordinate systems is straightforward.) To map pixel centers to 3D coordinates, we first 
scale the (£s, ys) values by the pixel spacings (Sx, Sy) (sometimes expressed in microns for 
solid-state sensors) and then describe the orientation of the sensor array relative to the camera 
projection center O. with an origin cs and a 3D rotation R, (Figure 2.8). 


The combined 2D to 3D projection can then be written as 


Sy 0 0 
Ts 
= 0 sy 0 _ = 
p [Ra cs] D i = Mx. (2.53) 
0 0 1 


The first two columns of the 3 x 3 matrix M, are the 3D vectors corresponding to unit steps 


in the image pixel array along the x, and y, directions, while the third column is the 3D 
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image array origin Cs. 

The matrix M, is parameterized by eight unknowns: the three parameters describing 
the rotation R., the three parameters describing the translation cs, and the two scale factors 
(Sx, Sy). Note that we ignore here the possibility of skew between the two axes on the image 
plane, since solid-state manufacturing techniques render this negligible. In practice, unless 
we have accurate external knowledge of the sensor spacing or sensor orientation, there are 
only seven degrees of freedom, since the distance of the sensor from the origin cannot be 
teased apart from the sensor spacing, based on external image measurement alone. 

However, estimating a camera model M, with the required seven degrees of freedom (i.e., 
where the first two columns are orthogonal after an appropriate re-scaling) is impractical, so 
most practitioners assume a general 3 x 3 homogeneous matrix form. 

The relationship between the 3D pixel center p and the 3D camera-centered point pe is 
given by an unknown scaling s, p = spe. We can therefore write the complete projection 


between p, and a homogeneous version of the pixel address x, as 
x, = aM;'p, = Kp.. (2.54) 


The 3 x 3 matrix K is called the calibration matrix and describes the camera intrinsics (as 
opposed to the camera’s orientation in space, which are called the extrinsics). 

From the above discussion, we see that K has seven degrees of freedom in theory and 
eight degrees of freedom (the full dimensionality of a3 x 3 homogeneous matrix) in practice. 
Why, then, do most textbooks on 3D computer vision and multi-view geometry (Faugeras 
1993; Hartley and Zisserman 2004; Faugeras and Luong 2001) treat K as an upper-triangular 
matrix with five degrees of freedom? 

While this is usually not made explicit in these books, it is because we cannot recover 
the full K matrix based on external measurement alone. When calibrating a camera (Sec- 
tion 11.1) based on external 3D points or other measurements (Tsai 1987), we end up esti- 
mating the intrinsic (K) and extrinsic (R, t) camera parameters simultaneously using a series 
of measurements, 

%, =K [R tl Pu = Pp», (2.55) 


where p., are known 3D world coordinates and 
P = K[Rjtj (2.56) 


is known as the camera matrix. Inspecting this equation, we see that we can post-multiply 
K by Rı and pre-multiply [R|t] by RT, and still end up with a valid calibration. Thus, it 
is impossible based on image measurements alone to know the true orientation of the sensor 
and the true camera intrinsics. 
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Figure 2.9 Simplified camera intrinsics showing the focal length f and the image center 
(cx, Cy). The image width and height are W and H. 


The choice of an upper-triangular form for K seems to be conventional. Given a full 3 
x 4 camera matrix P = K[R]|t], we can compute an upper-triangular K matrix using QR 
factorization (Golub and Van Loan 1996). (Note the unfortunate clash of terminologies: In 
matrix algebra textbooks, R represents an upper-triangular (right of the diagonal) matrix; in 


computer vision, R is an orthogonal rotation.) 


There are several ways to write the upper-triangular form of K. One possibility is 


fe S Ce 
K=]|0 fy cyl. (2.57) 
0 0 1 


which uses independent focal lengths fz and fy for the sensor x and y dimensions. The entry 
s encodes any possible skew between the sensor axes due to the sensor not being mounted 
perpendicular to the optical axis and (cz,cy,) denotes the image center expressed in pixel 
coordinates. The image center is also often called the principal point in the computer vision 
literature (Hartley and Zisserman 2004), although in optics, the principal points are 3D points 
usually inside the lens where the principal planes intersect the principal (optical) axis (Hecht 
2015). Another possibility is 


f s Cy 
K=|0 af cl, (2.58) 
0 0 1 


where the aspect ratio a has been made explicit and a common focal length f is used. 


In practice, for many applications an even simpler form can be obtained by setting a = 1 
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Figure 2.10 Central projection, showing the relationship between the 3D and 2D coordi- 
nates, p and x, as well as the relationship between the focal length f, image width W, and 
the horizontal field of view 0y. 


and s = 0, 
f 0 cg 
K=|0 f cy]. (2.59) 
0 0 1 


Often, setting the origin at roughly the center of the image, e.g., (cz, cy) = (W/2, H/2), 
where W and H are the image width and height, respectively, can result in a perfectly usable 
camera model with a single unknown, i.e., the focal length f. 

Figure 2.9 shows how these quantities can be visualized as part of a simplified imaging 
model. Note that now we have placed the image plane in front of the nodal point (projection 
center of the lens). The sense of the y-axis has also been flipped to get a coordinate system 


compatible with the way that most imaging libraries treat the vertical (row) coordinate. 


A note on focal lengths 


The issue of how to express focal lengths is one that often causes confusion in implementing 
computer vision algorithms and discussing their results. This is because the focal length 
depends on the units used to measure pixels. 

If we number pixel coordinates using integer values, say [0, W) x [0, H), the focal length 
f and camera center (cz, Cy) in (2.59) can be expressed as pixel values. How do these quan- 
tities relate to the more familiar focal lengths used by photographers? 

Figure 2.10 illustrates the relationship between the focal length f, the sensor width W, 


and the horizontal field of view 64, which obey the formula 


tan === or f = — |tan — 


bg W W bal" 
n | By (2.60) 
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For a traditional 35mm film camera, whose active exposure area is 24mm x 36mm, we have 
W = 36mm, and hence f is also expressed in millimeters.* For example, the “stock” lens 
that often comes with SLR (single lens reflex) cameras is 50mm, which is a good length, 
whereas 85mm is the standard for portrait photography. Since we work with digital images, 
however, it is more convenient to express W in pixels so that the focal length f can be used 
directly in the calibration matrix K as in (2.59). 

Another possibility is to scale the pixel coordinates so that they go from [—1, 1) along 
the longer image dimension and [—a7*,a71) along the shorter axis, where a > 1 is the 
image aspect ratio (as opposed to the sensor cell aspect ratio introduced earlier). This can be 


accomplished using modified normalized device coordinates, 
x, =(27, -W)/S and y = (2y, — H)/S, where S = max(W, H). (2.61) 


This has the advantage that the focal length f and image center (Cx, Cy) become independent 
of the image resolution, which can be useful when using multi-resolution, image-processing 
algorithms, such as image pyramids (Section 3.5). The use of S instead of W also makes 
the focal length the same for landscape (horizontal) and portrait (vertical) pictures, as is the 
case in 35mm photography. (In some computer graphics textbooks and systems, normalized 
device coordinates go from [—1, 1] x [—1, 1], which requires the use of two different focal 
lengths to describe the camera intrinsics (Watt 1995).) Setting S = W = 2 in (2.60), we 
obtain the simpler (unitless) relationship 
Oy 


fh = tan >. (2.62) 


The conversion between the various focal length representations is straightforward, e.g., 
to go from a unitless f to one expressed in pixels, multiply by W/2, while to convert from an 
f expressed in pixels to the equivalent 35mm focal length, multiply by 18mm. 


Camera matrix 


Now that we have shown how to parameterize the calibration matrix K, we can put the camera 


intrinsics and extrinsics together to obtain a single 3 x 4 camera matrix 


P=KÍ[R +]. (2.63) 


435mm denotes the width of the film strip, of which 24mm is used for exposing each frame and the remaining 
11mm for perforation and frame numbering. 

>To make the conversion truly accurate after a downsampling step in a pyramid, floating point values of W and 
H would have to be maintained, as they can become non-integer if they are ever odd at a larger resolution in the 
pyramid. 
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d=1.0 d=0.67 d=0.5 d d=0.5 d=0 d=-0.25 
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d = inverse depth d = projective depth 


Figure 2.11 Regular disparity (inverse depth) and projective depth (parallax from a refer- 


ence plane). 


It is sometimes preferable to use an invertible 4 x 4 matrix, which can be obtained by not 


dropping the last row in the P matrix, 


S |K O [R t| | 
te i] E | =e: (2.64) 


where E is a 3D rigid-body (Euclidean) transformation and K is the full-rank calibration 
matrix. The 4 x 4 camera matrix P can be used to map directly from 3D world coordinates 


Pw = (£w, Yw, Zw, 1) to screen coordinates (plus disparity), x, = (£s, Ys, 1, d), 
xs ~ Pp. (2.65) 


where ~ indicates equality up to scale. Note that after multiplication by P, the vector is 


divided by the third element of the vector to obtain the normalized form x, = (£s, Ys, 1, d). 


Plane plus parallax (projective depth) 


In general, when using the 4 x 4 matrix P, we have the freedom to remap the last row to 
whatever suits our purpose (rather than just being the “standard” interpretation of disparity as 
inverse depth). Let us re-write the last row of P as ps = s3[fig|co], where [ño || = 1. We 
then have the equation 


d= —(ño - Pw + Co), (2.66) 


where z = po: Pw = rz * (Pw — ©) is the distance of pu from the camera center C (2.25) 
along the optical axis Z (Figure 2.11). Thus, we can interpret d as the projective disparity 
or projective depth of a 3D scene point p,, from the reference plane fig - Pw + co = 0 
(Szeliski and Coughlan 1997; Szeliski and Golland 1999; Shade, Gortler et al. 1998; Baker, 
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Szeliski, and Anandan 1998). (The projective depth is also sometimes called parallax in 
reconstruction algorithms that use the term plane plus parallax (Kumar, Anandan, and Hanna 
1994; Sawhney 1994).) Setting ño = O and cy = 1, i.e., putting the reference plane at infinity, 
results in the more standard d = 1/2 version of disparity (Okutomi and Kanade 1993). 
Another way to see this is to invert the P matrix so that we can map pixels plus disparity 
directly back to 3D points, 
Dw =P "xs. (2.67) 


In general, we can choose P to have whatever form is convenient, i.e., to sample space us- 
ing an arbitrary projection. This can come in particularly handy when setting up multi-view 
stereo reconstruction algorithms, since it allows us to sweep a series of planes (Section 12.1.2) 
through space with a variable (projective) sampling that best matches the sensed image mo- 
tions (Collins 1996; Szeliski and Golland 1999; Saito and Kanade 1999). 


Mapping from one camera to another 


What happens when we take two images of a 3D scene from different camera positions or 
orientations (Figure 2.12a)? Using the full rank 4 x 4 camera matrix P = KE from (2.64), 


we can write the projection from world to screen coordinates as 
Xo ~ KoEop = Pop. (2.68) 


Assuming that we know the z-buffer or disparity value dọ for a pixel in one image, we can 


compute the 3D point location p using 
p ~ E Kp ‘Xo (2.69) 
and then project it into another image yielding 


%, ~ K,E,p = K, E1 E3 'Kp 1% = P,P) txo = MioXo. (2.70) 


Unfortunately, we do not usually have access to the depth coordinates of pixels in a regular 
photographic image. However, for a planar scene, as discussed above in (2.66), we can 
replace the last row of Po in (2.64) with a general plane equation, fig - p + Co, that maps 
points on the plane to dy = 0 values (Figure 2.12b). Thus, if we set dọ = 0, we can ignore 
the last column of Myo in (2.70) and also its last row, since we do not care about the final 


z-buffer depth. The mapping Equation (2.70) thus reduces to 


xı ~ Hioxo, (2.71) 


62 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


(a) (b) 


Figure 2.12 A point is projected into two images: (a) relationship between the 3D point co- 
ordinate (X,Y, Z, 1) and the 2D projected point (x, y, 1, d); (b) planar homography induced 
by points all lying on a common plane fig - p + Co = 0. 


where Hyp is a general 3 x 3 homography matrix and x; and Xy are now 2D homogeneous 
coordinates (1.e., 3-vectors) (Szeliski 1996). This justifies the use of the 8-parameter homog- 
raphy as a general alignment model for mosaics of planar scenes (Mann and Picard 1994; 
Szeliski 1996). 

The other special case where we do not need to know depth to perform inter-camera 
mapping is when the camera is undergoing pure rotation (Section 8.2.3), i.e., when ty = t1. 
In this case, we can write 


% ~ K,RiRp ‘Kj Xo = KiRioKo ‘Xo, (2.72) 


which again can be represented with a 3 x 3 homography. If we assume that the calibration 
matrices have known aspect ratios and centers of projection (2.59), this homography can be 
parameterized by the rotation amount and the two unknown focal lengths. This particular 


formulation is commonly used in image-stitching applications (Section 8.2.3). 


Object-centered projection 


When working with long focal length lenses, it often becomes difficult to reliably estimate 
the focal length from image measurements alone. This is because the focal length and the 
distance to the object are highly correlated and it becomes difficult to tease these two effects 
apart. For example, the change in scale of an object viewed through a zoom telephoto lens 
can either be due to a zoom change or to a motion towards the user. (This effect was put 
to dramatic use in some scenes of Alfred Hitchcock’s film Vertigo, where the simultaneous 
change of zoom and camera motion produces a disquieting effect.) 


This ambiguity becomes clearer if we write out the projection equation corresponding to 
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the simple calibration matrix K (2.59), 


Yr, ptt, 

= area Hs (2.73) 
ry: p+t 

af pi (2.74) 
rz: PT tz 


where rz, ry, and r, are the three rows of R. If the distance to the object center t; > ||p|| (the 
size of the object), the denominator is approximately t, and the overall scale of the projected 
object depends on the ratio of f to t,. It therefore becomes difficult to disentangle these two 
quantities. 

To see this more clearly, let 7, = t7! and s = 7.f. We can then re-write the above 
equations as 


a’ ty 
pee Pe. (2.75) 
1+7.r,-p 
ptt 
paa Ee y (2.76) 
1+ Nzrz: P 


(Szeliski and Kang 1994; Pighin, Hecker et al. 1998). The scale of the projection s can 
be reliably estimated if we are looking at a known object (i.e., the 3D coordinates p are 
known). The inverse distance 7, is now mostly decoupled from the estimates of s and can 
be estimated from the amount of foreshortening as the object rotates. Furthermore, as the 
lens becomes longer, i.e., the projection model becomes orthographic, there is no need to 
replace a perspective imaging model with an orthographic one, since the same equation can 
be used, with 7, — 0 (as opposed to f and t, both going to infinity). This allows us to form 
a natural link between orthographic reconstruction techniques such as factorization and their 


projective/perspective counterparts (Section 11.4.1). 


2.1.5 Lens distortions 


The above imaging models all assume that cameras obey a linear projection model where 
straight lines in the world result in straight lines in the image. (This follows as a natural 
consequence of linear matrix operations being applied to homogeneous coordinates.) Unfor- 
tunately, many wide-angle lenses have noticeable radial distortion, which manifests itself as 
a visible curvature in the projection of straight lines. (See Section 2.2.3 for a more detailed 
discussion of lens optics, including chromatic aberration.) Unless this distortion is taken into 
account, it becomes impossible to create highly accurate photorealistic reconstructions. For 
example, image mosaics constructed without taking radial distortion into account will often 
exhibit blurring due to the misregistration of corresponding features before pixel blending 
(Section 8.2). 
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Fortunately, compensating for radial distortion is not that difficult in practice. For most 
lenses, a simple quartic model of distortion can produce good results. Let (£e, y.) be the 
pixel coordinates obtained after perspective division but before scaling by focal length f and 
shifting by the image center (cz, cy), i.e., 


re: Pt ty 
Lo = ; 
r¿: Pp tlz 
2.17 
ry p+ty (2.77) 
Ye = 
rz: p+tz 


The radial distortion model says that coordinates in the observed images are displaced to- 
wards (barrel distortion) or away (pincushion distortion) from the image center by an amount 
proportional to their radial distance (Figure 2.13a—b).° The simplest radial distortion models 
use low-order polynomials, e.g., 

Êe = Lell +11? + Kors) 


(2.78) 
Ye = Yell | kir? + Kars), 


2 = x? + y? and «kı and kz are called the radial distortion parameters.’ This model, 


where rí = 
which also includes a tangential component to account for lens decentering, was first pro- 
posed in the photogrammetry literature by Brown (1966), and so is sometimes called the 
Brown or Brown—Conrady model. However, the tangential components of the distortion are 
usually ignored because they can lead to less stable estimates (Zhang 2000). 

After the radial distortion step, the final pixel coordinates can be computed using 


Ls = fle + Cr 
f (2.79) 


Ys = fe + Cy. 
A variety of techniques can be used to estimate the radial distortion parameters for a given 
lens, as discussed in Section 11.1.4. 

Sometimes the above simplified model does not model the true distortions produced by 
complex lenses accurately enough (especially at very wide angles). A more complete analytic 
model also includes tangential distortions and decentering distortions (Slama 1980). 

Fisheye lenses (Figure 2.13c) require a model that differs from traditional polynomial 
models of radial distortion. Fisheye lenses behave, to a first approximation, as equi-distance 


6 Anamorphic lenses, which are widely used in feature film production, do not follow this radial distortion model. 
Instead, they can be thought of, to a first approximation, as inducing different vertical and horizontal scaling, i.e., 
non-square pixels. 

7Sometimes the relationship between xe and ĉc is expressed the other way around, i.e., £e = ĉe(1 + Kir2 + 
262). This is convenient if we map image pixels into (warped) rays by dividing through by f. We can then undistort 


the rays and have true 3D rays in space. 


2.1 Geometric primitives and transformations 65 


(a) 


Figure 2.13 Radial lens distortions: (a) barrel, (b) pincushion, and (c) fisheye. The fisheye 
image spans almost 180° from side-to-side. 


projectors of angles away from the optical axis (Xiong and Turkowski 1997), 
r= f0, (2.80) 


which is the same as the polar projection described by Equations (8.55-8.57). Because of 
the mostly linear mapping between distance from the center (pixels) and viewing angle, such 
lenses are sometimes called f-theta lenses, which is likely where the popular RICOH THETA 
360° camera got its name. Xiong and Turkowski (1997) describe how this model can be 
extended with the addition of an extra quadratic correction in ¢ and how the unknown param- 
eters (center of projection, scaling factor s, etc.) can be estimated from a set of overlapping 
fisheye images using a direct (intensity-based) non-linear minimization algorithm. 

For even larger, less regular distortions, a parametric distortion model using splines may 
be necessary (Goshtasby 1989). If the lens does not have a single center of projection, it 
may become necessary to model the 3D line (as opposed to direction) corresponding to each 
pixel separately (Gremban, Thorpe, and Kanade 1988; Champleboux, Lavallée et al. 1992a; 
Grossberg and Nayar 2001; Sturm and Ramalingam 2004; Tardif, Sturm et al. 2009). Some 
of these techniques are described in more detail in Section 11.1.4, which discusses how to 
calibrate lens distortions. 

There is one subtle issue associated with the simple radial distortion model that is often 
glossed over. We have introduced a non-linearity between the perspective projection and final 
sensor array projection steps. Therefore, we cannot, in general, post-multiply an arbitrary 3 x 
3 matrix K with a rotation to put it into upper-triangular form and absorb this into the global 
rotation. However, this situation is not as bad as it may at first appear. For many applications, 
keeping the simplified diagonal form of (2.59) is still an adequate model. Furthermore, if we 
correct radial and other distortions to an accuracy where straight lines are preserved, we have 


66 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


light y$ 


source 


Figure 2.14 A simplified model of photometric image formation. Light is emitted by one 
or more light sources and is then reflected from an object's surface. A portion of this light is 
directed towards the camera. This simplified model ignores multiple reflections, which often 


occur in real-world scenes. 


essentially converted the sensor back into a linear imager and the previous decomposition still 
applies. 


2.2 Photometric image formation 


In modeling the image formation process, we have described how 3D geometric features in 
the world are projected into 2D features in an image. However, images are not composed of 
2D features. Instead, they are made up of discrete color or intensity values. Where do these 
values come from? How do they relate to the lighting in the environment, surface properties 
and geometry, camera optics, and sensor properties (Figure 2.14)? In this section, we develop 
a set of models to describe these interactions and formulate a generative process of image 
formation. A more detailed treatment of these topics can be found in textbooks on computer 
graphics and image synthesis (Cohen and Wallace 1993; Sillion and Puech 1994; Watt 1995; 
Glassner 1995; Weyrich, Lawrence et al. 2009; Hughes, van Dam et al. 2013; Marschner and 
Shirley 2015). 


2.2.1 Lighting 


Images cannot exist without light. To produce an image, the scene must be illuminated with 
one or more light sources. (Certain modalities such as fluorescence microscopy and X-ray 
tomography do not fit this model, but we do not deal with them in this book.) Light sources 


can generally be divided into point and area light sources. 
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A point light source originates at a single location in space (e.g., a small light bulb), 
potentially at infinity (e.g., the Sun). (Note that for some applications such as modeling soft 
shadows (penumbras), the Sun may have to be treated as an area light source.) In addition to 
its location, a point light source has an intensity and a color spectrum, i.e., a distribution over 
wavelengths L(A). The intensity of a light source falls off with the square of the distance 
between the source and the object being lit, because the same light is being spread over a 
larger (spherical) area. A light source may also have a directional falloff (dependence), but 
we ignore this in our simplified model. 

Area light sources are more complicated. A simple area light source such as a fluorescent 
ceiling light fixture with a diffuser can be modeled as a finite rectangular area emitting light 
equally in all directions (Cohen and Wallace 1993; Sillion and Puech 1994; Glassner 1995). 
When the distribution is strongly directional, a four-dimensional lightfield can be used instead 
(Ashdown 1993). 

A more complex light distribution that approximates, say, the incident illumination on an 
object sitting in an outdoor courtyard, can often be represented using an environment map 
(Greene 1986) (originally called a reflection map (Blinn and Newell 1976)). This representa- 
tion maps incident light directions Y to color values (or wavelengths, A), 


L(*;), (2.81) 


and is equivalent to assuming that all light sources are at infinity. Environment maps can be 
represented as a collection of cubical faces (Greene 1986), as a single longitude—latitude map 
(Blinn and Newell 1976), or as the image of a reflecting sphere (Watt 1995). A convenient 
way to get a rough model of a real-world environment map is to take an image of a reflective 
mirrored sphere (sometimes accompanied by a darker sphere to capture highlights) and to 
unwrap this image onto the desired environment map (Debevec 1998). Watt (1995) gives a 
nice discussion of environment mapping, including the formulas needed to map directions to 


pixels for the three most commonly used representations. 


2.2.2 Reflectance and shading 


When light hits an object’s surface, it is scattered and reflected (Figure 2.15a). Many different 
models have been developed to describe this interaction. In this section, we first describe the 
most general form, the bidirectional reflectance distribution function, and then look at some 
more specialized models, including the diffuse, specular, and Phong shading models. We also 
discuss how these models can be used to compute the global illumination corresponding to a 


scene. 
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(a) (b) 


Figure 2.15 (a) Light scatters when it hits a surface. (b) The bidirectional reflectance 
distribution function (BRDF) f(6;,;,9,,,) is parameterized by the angles that the inci- 
dent, %;, and reflected, Ẹ,, light ray directions make with the local surface coordinate frame 


(åz, dy, fi). 


The Bidirectional Reflectance Distribution Function (BRDF) 


The most general model of light scattering is the bidirectional reflectance distribution func- 
tion (BRDF).* Relative to some local coordinate frame on the surface, the BRDF is a four- 
dimensional function that describes how much of each wavelength arriving at an incident 
direction Y, is emitted in a reflected direction Y, (Figure 2.15b). The function can be written 


in terms of the angles of the incident and reflected directions relative to the surface frame as 


Fr (Cis Oa, Or, Dri A). (2.82) 


The BRDF is reciprocal, i.e., because of the physics of light transport, you can interchange 
the roles of ¥; and Y, and still get the same answer (this is sometimes called Helmholtz 
reciprocity). 

Most surfaces are isotropic, 1.e., there are no preferred directions on the surface as far 
as light transport is concerned. (The exceptions are anisotropic surfaces such as brushed 
(scratched) aluminum, where the reflectance depends on the light orientation relative to the 
direction of the scratches.) For an isotropic material, we can simplify the BRDF to 


fr(9i, Or, |r — il; A) or fr Vi Vrs Â; A), (2.83) 


as the quantities 6;, Ó,., and ¢, — @; can be computed from the directions ¥;, v,., and ni. 


8 Actually, even more general models of light transport exist, including some that model spatial variation along 
the surface, sub-surface scattering, and atmospheric effects—see Section 13.7.1—(Dorsey, Rushmeier, and Sillion 
2007; Weyrich, Lawrence et al. 2009). 
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Figure 2.16 This close-up of a statue shows both diffuse (smooth shading) and specular 
(shiny highlight) reflection, as well as darkening in the grooves and creases due to reduced 
light visibility and interreflections. (Photo courtesy of the Caltech Vision Lab, http://www. 


vision.caltech.edu/archive. html.) 


To calculate the amount of light exiting a surface point p in a direction Y, under a given 
lighting condition, we integrate the product of the incoming light L,(%;; A) with the BRDF 
(some authors call this step a convolution). Taking into account the foreshortening factor 


cost 6;, we obtain 
Li, (Wr; A) = fu (Vi; A) fr (Vi, Vr, ni; A) cost 60; dv;, (2.84) 


where 
cos* 0; = max(0, cos 0;). (2.85) 


Tf the light sources are discrete (a finite number of point light sources), we can replace the 


integral with a summation, 


(Vj A = 2) ) fr (04, r, Â; A) cos* 0;. (2.86) 


BRDFs for a given surface can be obtained through physical modeling (Torrance and 
Sparrow 1967; Cook and Torrance 1982; Glassner 1995), heuristic modeling (Phong 1975; 
Lafortune, Foo et al. 1997), or through empirical observation (Ward 1992; Westin, Arvo, and 
Torrance 1992; Dana, van Ginneken et al. 1999; Marschner, Westin et al. 2000; Matusik, 
Pfister et al. 2003; Dorsey, Rushmeier, and Sillion 2007; Weyrich, Lawrence et al. 2009; 
Shi, Mo et al. 2019).° Typical BRDFs can often be split into their diffuse and specular 


components, as described below. 


° See http://www1.cs.columbia.edu/CAVE/software/curet for a database of some empirically sampled BRDFs. 
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(b) 


Figure 2.17 (a) The diminution of returned light caused by foreshortening depends on 
v; : ii, the cosine of the angle between the incident light direction Ẹ; and the surface normal 
hi. (b) Mirror (specular) reflection: The incident light ray direction ©; is reflected onto the 


specular direction 8; around the surface normal it. 


Diffuse reflection 


The diffuse component (also known as Lambertian or matte reflection) scatters light uni- 
formly in all directions and is the phenomenon we most normally associate with shading, 
e.g., the smooth (non-shiny) variation of intensity with surface normal that is seen when ob- 
serving a statue (Figure 2.16). Diffuse reflection also often imparts a strong body color to 
the light, as it is caused by selective absorption and re-emission of light inside the object’s 
material (Shafer 1985; Glassner 1995). 


While light is scattered uniformly in all directions, i.e., the BRDF is constant, 
fa(¥i, Vr, Â; A) = fala), (2.87) 


the amount of light depends on the angle between the incident light direction and the surface 
normal 6;. This is because the surface area exposed to a given amount of light becomes larger 
at oblique angles, becoming completely self-shadowed as the outgoing surface normal points 
away from the light (Figure 2.17a). (Think about how you orient yourself towards the Sun or 
fireplace to get maximum warmth and how a flashlight projected obliquely against a wall is 
less bright than one pointing directly at it.) The shading equation for diffuse reflection can 


thus be written as 


Lale A) = Y Li(d) falà) cost 6; = Y Li(A) faa) | â], (2.88) 
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where 
(+; - A]? = max(0, %, - A). (2.89) 
Specular reflection 


The second major component of a typical BRDF is specular (gloss or highlight) reflection, 
which depends strongly on the direction of the outgoing light. Consider light reflecting off a 
mirrored surface (Figure 2.17b). Incident light rays are reflected in a direction that is rotated 
by 180° around the surface normal ñ. Using the same notation as in Equations (2.29-2.30), 


we can compute the specular reflection direction 8, as 
8, = vy — vı = (24* — I)v;. (2.90) 


The amount of light reflected in a given direction Y, thus depends on the angle 0, = 
cos”! (Fr - 8,) between the view direction Y, and the specular direction §;. For example, the 
Phong (1975) model uses a power of the cosine of the angle, 


fs(0s; A) = ks(A) cos* Os, (2.91) 
while the Torrance and Sparrow (1967) micro-facet model uses a Gaussian, 
fs(0s; A) = ks(A) exp(—c5 65). (2.92) 
Larger exponents ke (or inverse Gaussian widths cs) correspond to more specular surfaces 
with distinct highlights, while smaller exponents better model materials with softer gloss. 


Phong shading 


Phong (1975) combined the diffuse and specular components of reflection with another term, 
which he called the ambient illumination. This term accounts for the fact that objects are 
generally illuminated not only by point light sources but also by a general diffuse illumination 
corresponding to inter-reflection (e.g., the walls in a room) or distant sources, such as the 
blue sky. In the Phong model, the ambient term does not depend on surface orientation, but 
depends on the color of both the ambient illumination L,¿(A) and the object kq(A), 


fa(A) = ka (A) Lal): (2.93) 
Putting all of these terms together, we arrive at the Phong shading model, 


Ly (¥p3d) = ka(A)La(A) + ka(A D 021100 (9, - 8;)Re. (2.94) 
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(a) 


Figure 2.18 Cross-section through a Phong shading model BRDF for a fixed incident illu- 
mination direction: (a) component values as a function of angle away from surface normal; 
(b) polar plot. The value of the Phong exponent ke is indicated by the “Exp” labels and the 


light source is at an angle of 30° away from the normal. 


Figure 2.18 shows a typical set of Phong shading model components as a function of the 
angle away from the surface normal (in a plane containing both the lighting direction and the 
viewer). 

Typically, the ambient and diffuse reflection color distributions k,(A) and ka(A) are the 
same, since they are both due to sub-surface scattering (body reflection) inside the surface 
material (Shafer 1985). The specular reflection distribution k,(A) is often uniform (white), 
since it is caused by interface reflections that do not change the light color. (The exception 
to this is emphmetallic materials, such as copper, as opposed to the more common dielectric 
materials, such as plastics.) 

The ambient illumination £¿(A) often has a different color cast from the direct light 
sources L;(X), e.g., it may be blue for a sunny outdoor scene or yellow for an interior lit 
with candles or incandescent lights. (The presence of ambient sky illumination in shadowed 
areas is what often causes shadows to appear bluer than the corresponding lit portions of a 
scene). Note also that the diffuse component of the Phong model (or of any shading model) 
depends on the angle of the incoming light source ¥;, while the specular component depends 
on the relative angle between the viewer v, and the specular reflection direction 8, (which 
itself depends on the incoming light direction ©; and the surface normal ñ). 

The Phong shading model has been superseded in terms of physical accuracy by newer 
models in computer graphics, including the model developed by Cook and Torrance (1982) 
based on the original micro-facet model of Torrance and Sparrow (1967). While, initially, 
computer graphics hardware implemented the Phong model, the advent of programmable 


pixel shaders has made the use of more complex models feasible. 
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Di-chromatic reflection model 


The Torrance and Sparrow (1967) model of reflection also forms the basis of Shafer’s (1985) 
di-chromatic reflection model, which states that the apparent color of a uniform material lit 
from a single source depends on the sum of two terms, 


Lp (py; A) = Li( Pp, Vi, fi: A) + Ly (Gp, Di fis d) (2.95) 
= ci(A)m; (Fr, Vi, ñ) + com, Vi, ñ), (2.96) 


i.e., the radiance of the light reflected at the interface, L;, and the radiance reflected at the 
surface body, Ly. Each of these, in turn, is a simple product between a relative power spec- 
trum c(A), which depends only on wavelength, and a magnitude m(v,., ;, i), which depends 
only on geometry. (This model can easily be derived from a generalized version of Phong’s 
model by assuming a single light source and no ambient illumination, and rearranging terms.) 
The di-chromatic model has been successfully used in computer vision to segment specular 
colored objects with large variations in shading (Klinker 1993) and has inspired local two- 
color models for applications such as Bayer pattern demosaicing (Bennett, Uyttendaele et al. 
2006). 


Global illumination (ray tracing and radiosity) 


The simple shading model presented thus far assumes that light rays leave the light sources, 
bounce off surfaces visible to the camera, thereby changing in intensity or color, and arrive 
at the camera. In reality, light sources can be shadowed by occluders and rays can bounce 
multiple times around a scene while making their trip from a light source to the camera. 

Two methods have traditionally been used to model such effects. If the scene is mostly 
specular (the classic example being scenes made of glass objects and mirrored or highly pol- 
ished balls), the preferred approach is ray tracing or path tracing (Glassner 1995; Akenine- 
Möller and Haines 2002; Marschner and Shirley 2015), which follows individual rays from 
the camera across multiple bounces towards the light sources (or vice versa). If the scene 
is composed mostly of uniform albedo simple geometry illuminators and surfaces, radiosity 
(global illumination) techniques are preferred (Cohen and Wallace 1993; Sillion and Puech 
1994; Glassner 1995). Combinations of the two techniques have also been developed (Wal- 
lace, Cohen, and Greenberg 1987), as well as more general light transport techniques for 
simulating effects such as the caustics cast by rippling water. 

The basic ray tracing algorithm associates a light ray with each pixel in the camera im- 
age and finds its intersection with the nearest surface. A primary contribution can then be 


computed using the simple shading equations presented previously (e.g., Equation (2.94)) 
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for all light sources that are visible for that surface element. (An alternative technique for 
computing which surfaces are illuminated by a light source is to compute a shadow map, 
or shadow buffer, i.e., a rendering of the scene from the light sources perspective, and then 
compare the depth of pixels being rendered with the map (Williams 1983; Akenine-Móller 
and Haines 2002).) Additional secondary rays can then be cast along the specular direction 
towards other objects in the scene, keeping track of any attenuation or color change that the 
specular reflection induces. 

Radiosity works by associating lightness values with rectangular surface areas in the scene 
(including area light sources). The amount of light interchanged between any two (mutually 
visible) areas in the scene can be captured as a form factor, which depends on their relative 
orientation and surface reflectance properties, as well as the 1/r? fall-off as light is distributed 
over a larger effective sphere the further away it is (Cohen and Wallace 1993; Sillion and 
Puech 1994; Glassner 1995). A large linear system can then be set up to solve for the final 
lightness of each area patch, using the light sources as the forcing function (right-hand side). 
Once the system has been solved, the scene can be rendered from any desired point of view. 
Under certain circumstances, it is possible to recover the global illumination in a scene from 
photographs using computer vision techniques (Yu, Debevec et al. 1999). 

The basic radiosity algorithm does not take into account certain near field effects, such 
as the darkening inside corners and scratches, or the limited ambient illumination caused 
by partial shadowing from other surfaces. Such effects have been exploited in a number of 
computer vision algorithms (Nayar, Ikeuchi, and Kanade 1991; Langer and Zucker 1994). 

While all of these global illumination effects can have a strong effect on the appearance 
of a scene, and hence its 3D interpretation, they are not covered in more detail in this book. 
(But see Section 13.7.1 for a discussion of recovering BRDFs from real scenes and objects.) 


2.2.3 Optics 


Once the light from a scene reaches the camera, it must still pass through the lens before 
reaching the analog or digital sensor. For many applications, it suffices to treat the lens as an 
ideal pinhole that simply projects all rays through a common center of projection (Figures 2.8 
and 2.9). 

However, if we want to deal with issues such as focus, exposure, vignetting, and aber- 
ration, we need to develop a more sophisticated model, which is where the study of optics 
comes in (Möller 1988; Ray 2002; Hecht 2015). 

Figure 2.19 shows a diagram of the most basic lens model, 1.e., the thin lens composed 
of a single piece of glass with very low, equal curvature on both sides. According to the 


lens law (which can be derived using simple geometric arguments on light ray refraction), the 
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Zo=5 m 


Figure 2.19 A thin lens of focal length f focuses the light from a plane at a distance zo 
in front of the lens onto a plane at a distance z; behind the lens, where = + = = $ If 
the focal plane (vertical gray line next to c) is moved forward, the images are no longer in 
focus and the circle of confusion c (small thick line segments) depends on the distance of the 
image plane motion Az; relative to the lens aperture diameter d. The field of view (f.o.v.) 
depends on the ratio between the sensor width W and the focal length f (or, more precisely, 


the focusing distance zi, which is usually quite close to f). 


relationship between the distance to an object zo and the distance behind the lens at which a 


focused image is formed z; can be expressed as 
—+=3% (2.97) 


where f is called the focal length of the lens. If we let zo — oo, i.e., we adjust the lens (move 
the image plane) so that objects at infinity are in focus, we get z; = f, which is why we can 
think of a lens of focal length f as being equivalent (to a first approximation) to a pinhole at 
a distance f from the focal plane (Figure 2.10), whose field of view is given by (2.60). 

If the focal plane is moved away from its proper in-focus setting of z; (e.g., by twisting 
the focus ring on the lens), objects at z, are no longer in focus, as shown by the gray plane in 
Figure 2.19. The amount of misfocus is measured by the circle of confusion c (shown as short 
thick blue line segments on the gray plane).!° The equation for the circle of confusion can 
be derived using similar triangles; it depends on the distance of travel in the focal plane Az; 
relative to the original focus distance z; and the diameter of the aperture d (see Exercise 2.4). 

The allowable depth variation in the scene that limits the circle of confusion to an accept- 
able number is commonly called the depth of field and is a function of both the focus distance 


and the aperture, as shown diagrammatically by many lens markings (Figure 2.20). Since this 


10Tf the aperture is not completely circular, e.g., if it is caused by a hexagonal diaphragm, it is sometimes possible 
to see this effect in the actual blur function (Levin, Fergus et al. 2007; Joshi, Szeliski, and Kriegman 2008) or in the 


“glints” that are seen when shooting into the Sun. 


76 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


Focus Ring 


Focus Distance 


Depth of Field Indicator 16 118 4 4'8 11 16 
Set Aperture Ring 16 11 8 56 4 28 


(a) (b) 


Figure 2.20 Regular and zoom lens depth of field indicators. 


depth of field depends on the aperture diameter d, we also have to know how this varies with 
the commonly displayed f-number, which is usually denoted as f/# or N and is defined as 


1m=n=!, (2.98) 


where the focal length f and the aperture diameter d are measured in the same unit (say, 
millimeters). 

The usual way to write the f-number is to replace the # in f /# with the actual number, 
i.e., f/1.4, f/2, f/2.8,..., f/22. (Alternatively, we can say N = 1.4, etc.) An easy way to 
interpret these numbers is to notice that dividing the focal length by the f-number gives us the 
diameter d, so these are just formulas for the aperture diameter.!! 

Notice that the usual progression for f-numbers is in full stops, which are multiples of V2, 
since this corresponds to doubling the area of the entrance pupil each time a smaller f-number 
is selected. (This doubling is also called changing the exposure by one exposure value or EV. 
It has the same effect on the amount of light reaching the sensor as doubling the exposure 
duration, e.g., from 1/250 to 1/125; see Exercise 2.5.) 

Now that you know how to convert between f-numbers and aperture diameters, you can 
construct your own plots for the depth of field as a function of focal length f, circle of 
confusion c, and focus distance Zo, as explained in Exercise 2.4, and see how well these 
match what you observe on actual lenses, such as those shown in Figure 2.20. 

Of course, real lenses are not infinitely thin and therefore suffer from geometric aber- 
rations, unless compound elements are used to correct for them. The classic five Seidel 
aberrations, which arise when using third-order optics, include spherical aberration, coma, 
astigmatism, curvature of field, and distortion (Möller 1988; Ray 2002; Hecht 2015). 


1 This also explains why, with zoom lenses, the f-number varies with the current zoom (focal length) setting. 
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Figure 2.21 Jna lens subject to chromatic aberration, light at different wavelengths (e.g., 
the red and blue arrows) is focused with a different focal length f' and hence a different depth 


z;, resulting in both a geometric (in-plane) displacement and a loss of focus. 


Chromatic aberration 


Because the index of refraction for glass varies slightly as a function of wavelength, simple 
lenses suffer from chromatic aberration, which is the tendency for light of different col- 
ors to focus at slightly different distances (and hence also with slightly different magnifica- 
tion factors), as shown in Figure 2.21. The wavelength-dependent magnification factor, 1.e., 
the transverse chromatic aberration, can be modeled as a per-color radial distortion (Sec- 
tion 2.1.5) and, hence, calibrated using the techniques described in Section 11.1.4. The 
wavelength-dependent blur caused by longitudinal chromatic aberration can be calibrated 
using techniques described in Section 10.1.4. Unfortunately, the blur induced by longitudinal 
aberration can be harder to undo, as higher frequencies can get strongly attenuated and hence 
hard to recover. 

To reduce chromatic and other kinds of aberrations, most photographic lenses today are 
compound lenses made of different glass elements (with different coatings). Such lenses can 
no longer be modeled as having a single nodal point P through which all of the rays must 
pass (when approximating the lens with a pinhole model). Instead, these lenses have both a 
front nodal point, through which the rays enter the lens, and a rear nodal point, through which 
they leave on their way to the sensor. In practice, only the location of the front nodal point 
is of interest when performing careful camera calibration, e.g., when determining the point 
around which to rotate to capture a parallax-free panorama (see Section 8.2.3 and Littlefield 
(2006) and Houghton (2013)). 

Not all lenses, however, can be modeled as having a single nodal point. In particular, very 
wide-angle lenses such as fisheye lenses (Section 2.1.5) and certain catadioptric imaging 
systems consisting of lenses and curved mirrors (Baker and Nayar 1999) do not have a single 


point through which all of the acquired light rays pass. In such cases, it is preferable to 
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Figure 2.22 The amount of light hitting a pixel of surface area 91 depends on the square 
of the ratio of the aperture diameter d to the focal length f, as well as the fourth power of the 


off-axis angle a: cosine, cos* a. 


explicitly construct a mapping function (look-up table) between pixel coordinates and 3D 
rays in space (Gremban, Thorpe, and Kanade 1988; Champleboux, Lavallée et al. 1992a; 
Grossberg and Nayar 2001; Sturm and Ramalingam 2004; Tardif, Sturm et al. 2009), as 
mentioned in Section 2.1.5. 


Vignetting 


Another property of real-world lenses is vignetting, which is the tendency for the brightness 
of the image to fall off towards the edge of the image. 

Two kinds of phenomena usually contribute to this effect (Ray 2002). The first is called 
natural vignetting and is due to the foreshortening in the object surface, projected pixel, and 
lens aperture, as shown in Figure 2.22. Consider the light leaving the object surface patch 
of size ĝo located at an off-axis angle a. Because this patch is foreshortened with respect 
to the camera lens, the amount of light reaching the lens is reduced by a factor cosa. The 
amount of light reaching the lens is also subject to the usual 1/7? fall-off; in this case, the 
distance rọ = z,/cosa. The actual area of the aperture through which the light passes 
is foreshortened by an additional factor cos a, i.e., the aperture as seen from point O is an 
ellipse of dimensions d x d cos a. Putting all of these factors together, we see that the amount 
of light leaving O and passing through the aperture on its way to the image pixel located at I 
is proportional to 


docosa (dy? Td a 
7 5 (3) COS ag SO Q. (2.99) 


Since triangles AOPQ and AJ PJ are similar, the projected areas of the object surface do 


and image pixel ĝi are in the same (squared) ratio as Zo : ži, 


ôo z 
> (2.100) 
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Putting these together, we obtain the final relationship between the amount of light reaching 
pixel ¿ and the aperture diameter d, the focusing distance z; ~ f, and the off-axis angle a, 
d? d d\? 
5073 cost a = öz cost a = di (5) cost a, (2.101) 
which is called the fundamental radiometric relation between the scene radiance L and the 


light (irradiance) E reaching the pixel sensor, 


r (d\? 
_ 7a fa 4 
E = Lz (5) cos’ Q, (2.102) 


(Horn 1986; Nalwa 1993; Ray 2002; Hecht 2015). Notice in this equation how the amount of 
light depends on the pixel surface area (which is why the smaller sensors in point-and-shoot 
cameras are so much noisier than digital single lens reflex (SLR) cameras), the inverse square 
of the f-stop N = f/d (2.98), and the fourth power of the cost a off-axis fall-off, which is 
the natural vignetting term. 

The other major kind of vignetting, called mechanical vignetting, is caused by the internal 
occlusion of rays near the periphery of lens elements in a compound lens, and cannot easily 
be described mathematically without performing a full ray-tracing of the actual lens design.!? 
However, unlike natural vignetting, mechanical vignetting can be decreased by reducing the 
camera aperture (increasing the f-number). It can also be calibrated (along with natural vi- 
gnetting) using special devices such as integrating spheres, uniformly illuminated targets, or 
camera rotation, as discussed in Section 10.1.3. 


2.3 The digital camera 


After starting from one or more light sources, reflecting off one or more surfaces in the world, 
and passing through the camera’s optics (lenses), light finally reaches the imaging sensor. 
How are the photons arriving at this sensor converted into the digital (R, G, B) values that 
we observe when we look at a digital image? In this section, we develop a simple model that 
accounts for the most important effects, such as exposure (gain and shutter speed), non-linear 
mappings, sampling and aliasing, and noise. Figure 2.23, which is based on camera models 
developed by Healey and Kondepudy (1994), Tsin, Ramesh, and Kanade (2001), and Liu, 
Szeliski et al. (2008), shows a simple version of the processing stages that occur in mod- 
ern digital cameras. Chakrabarti, Scharstein, and Zickler (2009) developed a sophisticated 
24-parameter model that is an even better match to the processing performed in digital cam- 
eras, while Kim, Lin et al. (2012), Hasinoff, Sharlet et al. (2016), and Karaimer and Brown 


¡There are some empirical models that work well in practice (Kang and Weiss 2000; Zheng, Lin, and Kang 2006). 
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Figure 2.23 Image sensing pipeline, showing the various sources of noise as well as typical 
digital post-processing steps. 


(2016) provide more recent models of modern in-camera processing pipelines. Most recently, 
Brooks, Mildenhall et al. (2019) have developed detailed models of in-camera image process- 
ing pipelines to invert (unprocess) noisy JPEG images into their RAW originals, so that they 
can be better denoised, while Tseng, Yu ef al. (2019) develop a tunable model of camera 
processing pipelines that can be used for image quality optimization. 

Light falling on an imaging sensor is usually picked up by an active sensing area, inte- 
grated for the duration of the exposure (usually expressed as the shutter speed in a fraction of 
a second, e.g., ie S- 20% and then passed to a set of sense amplifiers. The two main kinds 
of sensor used in digital still and video cameras today are charge-coupled device (CCD) and 
complementary metal oxide on silicon (CMOS). 

In a CCD, photons are accumulated in each active well during the exposure time. Then, 
in a transfer phase, the charges are transferred from well to well in a kind of “bucket brigade” 
until they are deposited at the sense amplifiers, which amplify the signal and pass it to 
an analog-to-digital converter (ADC).'? Older CCD sensors were prone to blooming, when 
charges from one over-exposed pixel spilled into adjacent ones, but most newer CCDs have 
anti-blooming technology (“troughs” into which the excess charge can spill). 


In CMOS, the photons hitting the sensor directly affect the conductivity (or gain) of a 


1In digital still cameras, a complete frame is captured and then read out sequentially at once. However, if video 
is being captured, a rolling shutter, which exposes and transfers each line separately, is often used. In older video 


cameras, the even fields (lines) were scanned first, followed by the odd fields, in a process that is called interlacing. 
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Anatomy of the Active Pixel Sensor Photodiode 


CCD photon to electron 
conversion 


lll ld | s Mie- A 4 Transistor 
À h T Transistor - EN 
AT ange Raa e 


to voltage 
conversion 


(a) (b) 


Figure 2.24 Digital imaging sensors: (a) CCDs move photogenerated charge from pixel 
to pixel and convert it to voltage at the output node; CMOS imagers convert charge to 
voltage inside each pixel (Litwiller 2005) © 2005 Photonics Spectra; (b) cutaway dia- 
gram of a CMOS pixel sensor, from https://micro.magnet.fsu.edu/primer/digitalimaging/ 


cmosimagesensors.html. 


photodetector, which can be selectively gated to control exposure duration, and locally am- 
plified before being read out using a multiplexing scheme. Traditionally, CCD sensors out- 
performed CMOS in quality-sensitive applications, such as digital SLRs, while CMOS was 
better for low-power applications, but today CMOS is used in most digital cameras. 

The main factors affecting the performance of a digital image sensor are the shutter speed, 
sampling pitch, fill factor, chip size, analog gain, sensor noise, and the resolution (and quality) 
of the analog-to-digital converter. Many of the actual values for these parameters can be read 
from the EXIF tags embedded with digital images, while others can be obtained from the 


camera manufacturers’ specification sheets or from camera review or calibration websites. !4 


Shutter speed. The shutter speed (exposure time) directly controls the amount of light 
reaching the sensor and hence determines if images are under- or over-exposed. (For bright 
scenes, where a large aperture or slow shutter speed is desired to get a shallow depth of field 
or motion blur, neutral density filters are sometimes used by photographers.) For dynamic 
scenes, the shutter speed also determines the amount of motion blur in the resulting picture. 
Usually, a higher shutter speed (less motion blur) makes subsequent analysis easier (see Sec- 
tion 10.3 for techniques to remove such blur). However, when video is being captured for 


display, some motion blur may be desirable to avoid stroboscopic effects. 


Mhttp://www.clarkvision.com/imagedetail/digital.sensor.performance.summary 
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Sampling pitch. The sampling pitch is the physical spacing between adjacent sensor cells 
on the imaging chip (Figure 2.24). A sensor with a smaller sampling pitch has a higher 
sampling density and hence provides a higher resolution (in terms of pixels) for a given active 
chip area. However, a smaller pitch also means that each sensor has a smaller area and cannot 


accumulate as many photons; this makes it not as light sensitive and more prone to noise. 


Fill factor. The fill factor is the active sensing area size as a fraction of the theoretically 
available sensing area (the product of the horizontal and vertical sampling pitches). Higher fill 
factors are usually preferable, as they result in more light capture and less aliasing (see Sec- 
tion 2.3.1). While the fill factor was originally limited by the need to place additional electron- 
ics between the active sensing areas, modern backside illumination (or back-illuminated) sen- 
sors, coupled with efficient microlens designs, have largely removed this limitation (Fontaine 
2015).!* The fill factor of a camera can be determined empirically using a photometric camera 


calibration process (see Section 10.1.4). 


Chip size. Video and point-and-shoot cameras have traditionally used small chip areas (4- 
inch to >-inch sensors!°), while digital SLR cameras try to come closer to the traditional size 


of a 35mm film frame.!” 


When overall device size is not important, having a larger chip 
size 1s preferable, since each sensor cell can be more photo-sensitive. (For compact cameras, 
a smaller chip means that all of the optics can be shrunk down proportionately.) However, 
larger chips are more expensive to produce, not only because fewer chips can be packed into 
each wafer, but also because the probability of a chip defect goes up exponentially with the 


chip area. 


Analog gain. Before analog-to-digital conversion, the sensed signal is usually boosted by 
a sense amplifier. In video cameras, the gain on these amplifiers was traditionally controlled 
by automatic gain control (AGC) logic, which would adjust these values to obtain a good 
overall exposure. In newer digital still cameras, the user now has some additional control 
over this gain through the ZSO setting, which is typically expressed in ISO standard units 


such as 100, 200, or 400. Since the automated exposure control in most cameras also adjusts 


IShttps://en.wikipedia.org/wiki/Back-illuminated_sensor 

lóThese numbers refer to the “tube diameter” of the old vidicon tubes used in video cameras. The 1/2.5” sensor 
on the Canon SD800 camera actually measures 5.76mm x 4.29mm, i.e., a sixth of the size (on side) of a 35mm 
full-frame (36mm x 24mm) DSLR sensor. 

When a DSLR chip does not fill the 35mm full frame, it results in a multiplier effect on the lens focal length. 
For example, a chip that is only 0.6 the dimension of a 35mm frame will make a 50mm lens image the same angular 
extent as a 50/0.6 = 50 x 1.6 = 80mm lens, as demonstrated in (2.60). 
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the aperture and shutter speed, setting the ISO manually removes one degree of freedom from 
the camera’s control, just as manually specifying aperture and shutter speed does. In theory, a 
higher gain allows the camera to perform better under low light conditions (less motion blur 
due to long exposure times when the aperture is already maxed out). In practice, however, 


higher ISO settings usually amplify the sensor noise. 


Sensor noise. Throughout the whole sensing process, noise is added from various sources, 
which may include fixed pattern noise, dark current noise, shot noise, amplifier noise, and 
quantization noise (Healey and Kondepudy 1994; Tsin, Ramesh, and Kanade 2001). The 
final amount of noise present in a sampled image depends on all of these quantities, as well 
as the incoming light (controlled by the scene radiance and aperture), the exposure time, and 
the sensor gain. Also, for low light conditions where the noise is due to low photon counts, a 
Poisson model of noise may be more appropriate than a Gaussian model (Alter, Matsushita, 
and Tang 2006; Matsushita and Lin 2007a; Wilburn, Xu, and Matsushita 2008; Takamatsu, 
Matsushita, and Ikeuchi 2008). 

As discussed in more detail in Section 10.1.1, Liu, Szeliski et al. (2008) use this model, 
along with an empirical database of camera response functions (CRFs) obtained by Grossberg 
and Nayar (2004), to estimate the noise level function (NLF) for a given image, which predicts 
the overall noise variance at a given pixel as a function of its brightness (a separate NLF is 
estimated for each color channel). An alternative approach, when you have access to the 
camera before taking pictures, is to pre-calibrate the NLF by taking repeated shots of a scene 
containing a variety of colors and luminances, such as the Macbeth Color Chart shown in 
Figure 10.3b (McCamy, Marcus, and Davidson 1976). (When estimating the variance, be sure 
to throw away or downweight pixels with large gradients, as small shifts between exposures 
will affect the sensed values at such pixels.) Unfortunately, the pre-calibration process may 
have to be repeated for different exposure times and gain settings because of the complex 
interactions occurring within the sensing system. 

In practice, most computer vision algorithms, such as image denoising, edge detection, 
and stereo matching, all benefit from at least a rudimentary estimate of the noise level. Barring 
the ability to pre-calibrate the camera or to take repeated shots of the same scene, the simplest 
approach is to look for regions of near-constant value and to estimate the noise variance in 
such regions (Liu, Szeliski ef al. 2008). 


ADC resolution. The final step in the analog processing chain occurring within an imaging 
sensor is the analog to digital conversion (ADC). While a variety of techniques can be used 


to implement this process, the two quantities of interest are the resolution of this process 
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(how many bits it yields) and its noise level (how many of these bits are useful in practice). 
For most cameras, the number of bits quoted (eight bits for compressed JPEG images and a 
nominal 16 bits for the RAW formats provided by some DSLRs) exceeds the actual number 
of usable bits. The best way to tell is to simply calibrate the noise of a given sensor, e.g., 
by taking repeated shots of the same scene and plotting the estimated noise as a function of 


brightness (Exercise 2.6). 


Digital post-processing. Once the irradiance values arriving at the sensor have been con- 
verted to digital bits, most cameras perform a variety of digital signal processing (DSP) 
operations to enhance the image before compressing and storing the pixel values. These in- 
clude color filter array (CFA) demosaicing, white point setting, and mapping of the luminance 
values through a gamma function to increase the perceived dynamic range of the signal. We 
cover these topics in Section 2.3.2 but, before we do, we return to the topic of aliasing, which 


was mentioned in connection with sensor array fill factors. 


Newer imaging sensors. The capabilities of imaging sensor and related technologies such 
as depth sensors continue to evolve rapidly. Conferences that track these developments in- 
clude the IS&T Symposium on Electronic Imaging Science and Technology sponsored by the 
Society for Imaging Science and Technology and the Image Sensors World blog. 


2.3.1 Sampling and aliasing 


What happens when a field of light impinging on the image sensor falls onto the active sense 
areas in the imaging chip? The photons arriving at each active cell are integrated and then 
digitized, as shown in Figure 2.24. However, if the fill factor on the chip is small and the 
signal is not otherwise band-limited, visually unpleasing aliasing can occur. 

To explore the phenomenon of aliasing, let us first look at a one-dimensional signal (Fig- 
ure 2.25), in which we have two sine waves, one at a frequency of f = 3/4 and the other at 
f = 5/4. If we sample these two signals at a frequency of f = 2, we see that they produce 
the same samples (shown in black), and so we say that they are aliased.'* Why is this a bad 
effect? In essence, we can no longer reconstruct the original signal, since we do not know 
which of the two original frequencies was present. 

In fact, Shannon’s Sampling Theorem shows that the minimum sampling (Oppenheim 


and Schafer 1996; Oppenheim, Schafer, and Buck 1999) rate required to reconstruct a signal 


18 An alias is an alternate name for someone, so the sampled signal corresponds to two different aliases. 
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Figure 2.25 Aliasing of a one-dimensional signal: The blue sine wave at f = 3/4 and the 
red sine wave at f = 5/4 have the same digital samples, when sampled at f = 2. Even after 
convolution with a 100% fill factor box filter, the two signals, while no longer of the same 
magnitude, are still aliased in the sense that the sampled red signal looks like an inverted 
lower magnitude version of the blue signal. (The image on the right is scaled up for better 


visibility. The actual sine magnitudes are 30% and —18% of their original values.) 


from its instantaneous samples must be at least twice the highest frequency, '” 
fs > 2fmax- (2.103) 


The maximum frequency in a signal is known as the Nyquist frequency and the inverse of the 
minimum sampling frequency rs = 1/ fs is known as the Nyquist rate. 

However, you may ask, as an imaging chip actually averages the light field over a finite 
area, are the results on point sampling still applicable? Averaging over the sensor area does 
tend to attenuate some of the higher frequencies. However, even if the fill factor is 100%, 
as in the right image of Figure 2.25, frequencies above the Nyquist limit (half the sampling 
frequency) still produce an aliased signal, although with a smaller magnitude than the corre- 
sponding band-limited signals. 

A more convincing argument as to why aliasing is bad can be seen by downsampling 
a signal using a poor quality filter such as a box (square) filter. Figure 2.26 shows a high- 
frequency chirp image (so called because the frequencies increase over time), along with the 
results of sampling it with a 25% fill-factor area sensor, a 100% fill-factor sensor, and a high- 
quality 9-tap filter. Additional examples of downsampling (decimation) filters can be found 
in Section 3.5.2 and Figure 3.29. 

The best way to predict the amount of aliasing that an imaging system (or even an image 
processing algorithm) will produce is to estimate the point spread function (PSF), which 
represents the response of a particular pixel sensor to an ideal point light source. The PSF 


is a combination (convolution) of the blur induced by the optical system (lens) and the finite 


19The actual theorem states that fs must be at least twice the signal bandwidth but, as we are not dealing with 


modulated signals such as radio waves during image capture, the maximum frequency suffices. 
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Figure 2.26 Aliasing of a two-dimensional signal: (a) original full-resolution image; (b) 
downsampled 4 x with a 25% fill factor box filter; (c) downsampled 4 x with a 100% fill 
factor box filter; (d) downsampled 4 x with a high-quality 9-tap filter. Notice how the higher 
frequencies are aliased into visible frequencies with the lower quality filters, while the 9-tap 


filter completely removes these higher frequencies. 


integration area of a chip sensor.?% 


If we know the blur function of the lens and the fill factor (sensor area shape and spacing) 
for the imaging chip (plus, optionally, the response of the anti-aliasing filter), we can convolve 
these (as described in Section 3.2) to obtain the PSF. Figure 2.27a shows the one-dimensional 
cross-section of a PSF for a lens whose blur function is assumed to be a disc with a radius 
equal to the pixel spacing s plus a sensing chip whose horizontal fill factor is 80%. Taking 
the Fourier transform of this PSF (Section 3.4), we obtain the modulation transfer function 
(MTF), from which we can estimate the amount of aliasing as the area of the Fourier magni- 
tude outside the f < f, Nyquist frequency.?! If we defocus the lens so that the blur function 
has a radius of 2s (Figure 2.27c), we see that the amount of aliasing decreases significantly, 
but so does the amount of image detail (frequencies closer to f = fs). 

Under laboratory conditions, the PSF can be estimated (to pixel precision) by looking at 
a point light source such as a pinhole in a black piece of cardboard lit from behind. However, 
this PSF (the actual image of the pinhole) is only accurate to a pixel resolution and, while 
it can model larger blur (such as blur caused by defocus), it cannot model the sub-pixel 
shape of the PSF and predict the amount of aliasing. An alternative technique, described in 
Section 10.1.4, is to look at a calibration pattern (e.g., one consisting of slanted step edges 
(Reichenbach, Park, and Narayanswamy 1991; Williams and Burns 2001; Joshi, Szeliski, and 


Imaging chips usually interpose an optical anti-aliasing filter just before the imaging chip to reduce or control 
the amount of aliasing. 

21 The complex Fourier transform of the PSF is actually called the optical transfer function (OTF) (Williams 1999). 
Its magnitude is called the modulation transfer function (MTF) and its phase is called the phase transfer function 
(PTF). 
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Figure 2.27 Sample point spread functions (PSF): The diameter of the blur disc (blue) 
in (a) is equal to half the pixel spacing, while the diameter in (c) is twice the pixel spacing. 
The horizontal fill factor of the sensing chip is 80% and is shown in brown. The convolution 
of these two kernels gives the point spread function, shown in green. The Fourier response 
of the PSF (the MTF) is plotted in (b) and (d). The area above the Nyquist frequency where 


aliasing occurs is shown in red. 


Kriegman 2008)) whose ideal appearance can be re-synthesized to sub-pixel precision. 

In addition to occurring during image acquisition, aliasing can also be introduced in var- 
10us image processing operations, such as resampling, upsampling, and downsampling. Sec- 
tions 3.4 and 3.5.2 discuss these issues and show how careful selection of filters can reduce 
the amount of aliasing. 


2.3.2 Color 


In Section 2.2, we saw how lighting and surface reflections are functions of wavelength. 
When the incoming light hits the imaging sensor, light from different parts of the spectrum is 
somehow integrated into the discrete red, green, and blue (RGB) color values that we see in 
a digital image. How does this process work and how can we analyze and manipulate color 
values? 


You probably recall from your childhood days the magical process of mixing paint colors 
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(a) (b) 


Figure 2.28 Primary and secondary colors: (a) additive colors red, green, and blue can 
be mixed to produce cyan, magenta, yellow, and white; (b) subtractive colors cyan, magenta, 


and yellow can be mixed to produce red, green, blue, and black. 


to obtain new ones. You may recall that blue+yellow makes green, red+blue makes purple, 
and red+green makes brown. If you revisited this topic at a later age, you may have learned 
that the proper subtractive primaries are actually cyan (a light blue-green), magenta (pink), 
and yellow (Figure 2.28b), although black is also often used in four-color printing (CMYK). 
If you ever subsequently took any painting classes, you learned that colors can have even 
more fanciful names, such as alizarin crimson, cerulean blue, and chartreuse. The subtractive 
colors are called subtractive because pigments in the paint absorb certain wavelengths in the 
color spectrum. 

Later on, you may have learned about the additive primary colors (red, green, and blue) 
and how they can be added (with a slide projector or on a computer monitor) to produce cyan, 
magenta, yellow, white, and all the other colors we typically see on our TV sets and monitors 
(Figure 2.284). 

Through what process is it possible for two different colors, such as red and green, to 
interact to produce a third color like yellow? Are the wavelengths somehow mixed up to 
produce a new wavelength? 

You probably know that the correct answer has nothing to do with physically mixing 
wavelengths. Instead, the existence of three primaries is a result of the tri-stimulus (or tri- 
chromatic) nature of the human visual system, since we have three different kinds of cells 
called cones, each of which responds selectively to a different portion of the color spec- 
trum (Glassner 1995; Wandell 1995; Wyszecki and Stiles 2000; Livingstone 2008; Frisby 
and Stone 2010; Reinhard, Heidrich et al. 2010; Fairchild 2013). Note that for machine 


221t is possible to use additional inks such as orange, green, and violet to further extend the color gamut. 
23 See also Mark Fairchild’s web page, http://markfairchild.org/WhyIsColor/books_links.html. 
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Figure 2.29 Standard CIE color matching functions: (a) F(X), G(X), D(X) color spectra 
obtained from matching pure colors to the R=700.0nm, G=546.1nm, and B=435.8nm pri- 


maries; (b) E(A), yA), Z(A) color matching functions, which are linear combinations of the 
(F(A), JA), B(A)) spectra. 


vision applications, such as remote sensing and terrain classification, it is preferable to use 
many more wavelengths. Similarly, surveillance applications can often benefit from sensing 


in the near-infrared (NIR) range. 


CIE RGB and XYZ 


To test and quantify the tri-chromatic theory of perception, we can attempt to reproduce all 
monochromatic (single wavelength) colors as a mixture of three suitably chosen primaries. 
(Pure wavelength light can be obtained using either a prism or specially manufactured color 
filters.) In the 1930s, the Commission Internationale d’ Eclairage (CIE) standardized the RGB 
representation by performing such color matching experiments using the primary colors of 
red (700.0nm wavelength), green (546.1nm), and blue (435.8nm). 

Figure 2.29 shows the results of performing these experiments with a standard observer, 
i.e., averaging perceptual results over a large number of subjects.?* You will notice that for 
certain pure spectra in the blue-green range, a negative amount of red light has to be added, 
i.e., a certain amount of red has to be added to the color being matched to get a color match. 
These results also provided a simple explanation for the existence of metamers, which are 
colors with different spectra that are perceptually indistinguishable. Note that two fabrics or 
paint colors that are metamers under one light may no longer be so under different lighting. 


Because of the problem associated with mixing negative light, the CIE also developed a 


2 As Michael Brown notes in his tutorial on color (Brown 2019), the standard observer is actually an average 
taken over only 17 British subjects in the 1920s. 
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new color space called XYZ, which contains all of the pure spectral colors within its positive 
octant. (It also maps the Y axis to the luminance, i.e., perceived relative brightness, and maps 
pure white to a diagonal (equal-valued) vector.) The transformation from RGB to XYZ is 
given by 
X i 0.49 0.31 0.20 R 
Y 0.17697 0.81240 0.01063| |G]. (2.104) 
0.00 0.01 0.99 B 


~ 0.17697 


While the official definition of the CIE XYZ standard has the matrix normalized so that the 
Y value corresponding to pure red is 1, a more commonly used form is to omit the leading 
fraction, so that the second row adds up to one, i.e., the RGB triplet (1, 1, 1) maps to a Y value 
of 1. Linearly blending the (F(A), 3(A), b(\)) curves in Figure 2.29a according to (2.104), we 
obtain the resulting (E(A), yA), Z(A)) curves shown in Figure 2.29b. Notice how all three 
spectra (color matching functions) now have only positive values and how the y(A) curve 
matches that of the luminance perceived by humans. 

If we divide the XYZ values by the sum of X+Y+Z, we obtain the chromaticity coordi- 


nates 


x Y Z 
= —— => —— oo 
XYZ " XiYez XYZ 


which sum to 1. The chromaticity coordinates discard the absolute intensity of a given color 


T 


(2.105) 


sample and just represent its pure color. If we sweep the monochromatic color A parameter in 
Figure 2.29b from A = 380nm to A = 800nm, we obtain the familiar chromaticity diagram 
shown in Figure 2.30a. This figure shows the (x, y) value for every color value perceivable 
by most humans. (Of course, the CMYK reproduction process in this book does not actually 
span the whole gamut of perceivable colors.) The outer curved rim represents where all of the 
pure monochromatic color values map in (x, y) space, while the lower straight line, which 
connects the two endpoints, is known as the purple line. The inset triangle spans the red, 
green, and blue single-wavelength primaries used in the original color matching experiments, 
while E denotes the white point. 

A convenient representation for color values, when we want to tease apart luminance 
and chromaticity, is therefore Yxy (luminance plus the two most distinctive chrominance 


components). 


L*a*b* color space 


While the XYZ color space has many convenient properties, including the ability to separate 
luminance from chrominance, it does not actually predict how well humans perceive differ- 


ences in color or luminance. 


2.3 The digital camera 91 


470 
0.0 — 460 7P 
00 01 62 03 04 05 06 07 08 


Figure 2.30 CIE chromaticity diagram, showing the pure single-wavelength spectral colors 
along the perimeter and the white point at E, plotted along their corresponding (x, y) values. 
(a) the red, green, and blue primaries do not span the complete gamut, so that negative 
amounts of red need to be added to span the blue-green range; (b) the MacAdam ellipses 
show color regions of equal discriminability, and form the basis of the Lab perceptual color 


space. 


Because the response of the human visual system is roughly logarithmic (we can perceive 
relative laminance differences of about 1%), the CIE defined a non-linear re-mapping of the 
XYZ space called L*a*b* (also sometimes called CIELAB), where differences in luminance 
or chrominance are more perceptually uniform, as shown in Figure 2.30b.2 


The L* component of lightness is defined as 


Y 


L* =116f (=) À (2.106) 


where Y, is the luminance value for nominal white (Fairchild 2013) and 


pie i> oe 
HO = t/(352) + 26/3 else, wey) 


is a finite-slope approximation to the cube root with 6 = 6/29. The resulting 0...100 scale 


roughly measures equal amounts of lightness perceptibility. 


25 Another perceptually motivated color space called L*u*v* was developed and standardized simultaneously 
(Fairchild 2013). 
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In a similar fashion, the a* and b* components are defined as 


a* = 500 E (=) -f 65] and b* = 200 E (=) -f (Z) (2.108) 


where again, (Xn, Yn, Zn) is the measured white point. Figure 2.33i-k show the L*a*b* 


representation for a sample color image. 


Color cameras 


While the preceding discussion tells us how we can uniquely describe the perceived tri- 
stimulus description of any color (spectral distribution), it does not tell us how RGB still 
and video cameras actually work. Do they just measure the amount of light at the nominal 
wavelengths of red (700.0nm), green (546. 1nm), and blue (435.8nm)? Do color monitors just 
emit exactly these wavelengths and, if so, how can they emit negative red light to reproduce 
colors in the cyan range? 

In fact, the design of RGB video cameras has historically been based around the availabil- 
ity of colored phosphors that go into television sets. When standard-definition color television 
was invented (NTSC), a mapping was defined between the RGB values that would drive the 
three color guns in the cathode ray tube (CRT) and the XYZ values that unambiguously de- 
fine perceived color (this standard was called ITU-R BT.601). With the advent of HDTV and 
newer monitors, a new standard called ITU-R BT.709 was created, which specifies the XYZ 
values of each of the color primaries, 


X 0.412453 0.357580 0.180423| | R709 
Y | = |0.212671 0.715160 0.072169| |G7o9| - (2.109) 
Z 0.019334 0.119193 0.950227| | Broo 


In practice, each color camera integrates light according to the spectral response function 
of its red, green, and blue sensors, 


G= | LO)SE(AdA, (2.110) 


where L(A) is the incoming spectrum of light at a given pixel and (Sr(A), SalA), Sp(A)} 


are the red, green, and blue spectral sensitivities of the corresponding sensors. 
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Can we tell what spectral sensitivities the cameras actually have? Unless the camera 
manufacturer provides us with these data or we observe the response of the camera to a whole 
spectrum of monochromatic lights, these sensitivities are not specified by a standard such as 
BT.709. Instead, all that matters is that the tri-stimulus values for a given color produce the 
specified RGB values. The manufacturer is free to use sensors with sensitivities that do not 
match the standard XYZ definitions, so long as they can later be converted (through a linear 
transform) to the standard colors. 

Similarly, while TV and computer monitors are supposed to produce RGB values as spec- 
ified by Equation (2.109), there is no reason that they cannot use digital logic to transform 
the incoming RGB values into different signals to drive each of the color channels.”° Prop- 
erly calibrated monitors make this information available to software applications that perform 
color management, so that colors in real life, on the screen, and on the printer all match as 


closely as possible. 


Color filter arrays 


While early color TV cameras used three vidicons (tubes) to perform their sensing and later 
cameras used three separate RGB sensing chips, most of today’s digital still and video cam- 
eras use a color filter array (CFA), where alternating sensors are covered by different colored 
filters (Figure 2.24).27 

The most commonly used pattern in color cameras today is the Bayer pattern (Bayer 
1976), which places green filters over half of the sensors (in a checkerboard pattern), and red 
and blue filters over the remaining ones (Figure 2.31). The reason that there are twice as many 
green filters as red and blue is because the luminance signal is mostly determined by green 
values and the visual system is much more sensitive to high-frequency detail in luminance 
than in chrominance (a fact that is exploited in color image compression—see Section 2.3.3). 
The process of interpolating the missing color values so that we have valid RGB values for 
all the pixels is known as demosaicing and is covered in detail in Section 10.3.1. 

Similarly, color LCD monitors typically use alternating stripes of red, green, and blue 
filters placed in front of each liquid crystal active area to simulate the experience of a full color 
display. As before, because the visual system has higher resolution (acuity) in luminance than 
chrominance, it is possible to digitally prefilter RGB (and monochrome) images to enhance 


©The latest OLED TV monitors are now introducing higher dynamic range (HDR) and wide color gamut (WCG), 
https://www.cnet.com/how- to/what-is- wide-color- gamut- weg. 

274 chip design by Foveon stacked the red, green, and blue sensors beneath each other, but it never gained 
widespread adoption. Descriptions of alternative color filter arrays that have been proposed over the years can be 


found at https://en.wikipedia.org/wiki/Color_filter_array. 
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(a) (b) 


Figure 2.31 Bayer RGB pattern: (a) color filter array layout; (b) interpolated pixel values, 


with unknown (guessed) values shown as lower case. 


the perception of crispness (Betrisey, Blinn et al. 2000; Platt 2000b). 


Color balance 


Before encoding the sensed RGB values, most cameras perform some kind of color balancing 
operation in an attempt to move the white point of a given image closer to pure white (equal 
RGB values). If the color system and the illumination are the same (the BT.709 system uses 
the daylight illuminant De; as its reference white), the change may be minimal. However, 
if the illuminant is strongly colored, such as incandescent indoor lighting (which generally 
results in a yellow or orange hue), the compensation can be quite significant. 

A simple way to perform color correction is to multiply each of the RGB values by a 
different factor (i.e., to apply a diagonal matrix transform to the RGB color space). More 
complicated transforms, which are sometimes the result of mapping to XYZ space and back, 
actually perform a color twist, i.e., they use a general 3 x 3 color transform matrix.?% Exer- 


cise 2.8 has you explore some of these issues. 


Gamma 


In the early days of black and white television, the phosphors in the CRT used to display 
the TV signal responded non-linearly to their input voltage. The relationship between the 
voltage and the resulting brightness was characterized by a number called gamma (y), since 
the formula was roughly 

B=V", (2.111) 


28 Those of you old enough to remember the early days of color television will naturally think of the hue adjustment 


knob on the television set, which could produce truly bizarre results. 
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A 
Y’ Y 
visible 
noise 


quantization Y’ 
noise 


Figure 2.32 Gamma compression: (a) The relationship between the input signal luminance 
Y and the transmitted signal Y” is given by Y' = Y*/?. (b) At the receiver, the signal Y” is 
exponentiated by the factor y, Y = Y”. Noise introduced during transmission is squashed 


in the dark regions, which corresponds to the more noise-sensitive region of the visual system. 


with a y of about 2.2. To compensate for this effect, the electronics in the TV camera would 


pre-map the sensed luminance Y through an inverse gamma, 
Y'=Y7, (2.112) 


with a typical value of 3 = 0.45. 

The mapping of the signal through this non-linearity before transmission had a beneficial 
side effect: noise added during transmission (remember, these were analog days!) would be 
reduced (after applying the gamma at the receiver) in the darker regions of the signal where 
it was more visible (Figure 2.32).2? (Remember that our visual system is roughly sensitive to 
relative differences in luminance.) 

When color television was invented, it was decided to separately pass the red, green, and 
blue signals through the same gamma non-linearity before combining them for encoding. 
Today, even though we no longer have analog noise in our transmission systems, signals are 
still quantized during compression (see Section 2.3.3), so applying inverse gamma to sensed 
values remains useful. 

Unfortunately, for both computer vision and computer graphics, the presence of gamma 
in images is often problematic. For example, the proper simulation of radiometric phenomena 
such as shading (see Section 2.2 and Equation (2.88)) occurs in a linear radiance space. Once 
all of the computations have been performed, the appropriate gamma should be applied before 


display. Unfortunately, many computer graphics systems (such as shading models) operate 


2 A related technique called companding was the basis of the Dolby noise reduction systems used with audio 


tapes. 
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directly on RGB values and display these values directly. (Fortunately, newer color imaging 
standards such as the 16-bit scRGB use a linear space, which makes this less of a problem 
(Glassner 1995).) 

In computer vision, the situation can be even more daunting. The accurate determination 
of surface normals, using a technique such as photometric stereo (Section 13.1.1) or even a 
simpler operation such as accurate image deblurring, require that the measurements be in a 
linear space of intensities. Therefore, it is imperative when performing detailed quantitative 
computations such as these to first undo the gamma and the per-image color re-balancing 
in the sensed color values. Chakrabarti, Scharstein, and Zickler (2009) develop a sophisti- 
cated 24-parameter model that is a good match to the processing performed by today’s digital 
cameras; they also provide a database of color images you can use for your own testing. 

For other vision applications, however, such as feature detection or the matching of sig- 
nals in stereo and motion estimation, this linearization step is often not necessary. In fact, 
determining whether it is necessary to undo gamma can take some careful thinking, e.g., in 
the case of compensating for exposure variations in image stitching (see Exercise 2.7). 

If all of these processing steps sound confusing to model, they are. Exercise 2.9 has you 
try to tease apart some of these phenomena using empirical investigation, i.e., taking pictures 
of color charts and comparing the RAW and JPEG compressed color values. 


Other color spaces 


While RGB and XYZ are the primary color spaces used to describe the spectral content (and 
hence tri-stimulus response) of color signals, a variety of other representations have been 
developed both in video and still image coding and in computer graphics. 

The earliest color representation developed for video transmission was the YIQ standard 
developed for NTSC video in North America and the closely related YUV standard developed 
for PAL in Europe. In both of these cases, it was desired to have a luma channel Y (so called 
since it only roughly mimics true luminance) that would be comparable to the regular black- 
and-white TV signal, along with two lower frequency chroma channels. 

In both systems, the Y signal (or more appropriately, the Y” luma signal since it is gamma 
compressed) is obtained from 


Y6o1 = 0.299R’ + 0.587G" + 0.114B", (2.113) 


where R’G’B’ is the triplet of gamma-compressed color components. When using the newer 
color definitions for HDTV in BT.709, the formula is 


Yog = 0.2125R’ + 0.7154G' + 0.0721B'. (2.114) 
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The UV components are derived from scaled versions of (B"—Y”) and (R'—Y”), namely, 
U =0.492111(B' —- Y”) and V=0.877283(R' — Y”), (2.115) 


whereas the IQ components are the UV components rotated through an angle of 33°. In 
composite (NTSC and PAL) video, the chroma signals were then low-pass filtered horizon- 
tally before being modulated and superimposed on top of the Y’ luma signal. Backward 
compatibility was achieved by having older black-and-white TV sets effectively ignore the 
high-frequency chroma signal (because of slow electronics) or, at worst, superimposing it as 
a high-frequency pattern on top of the main signal. 

While these conversions were important in the early days of computer vision, when frame 
grabbers would directly digitize the composite TV signal, today all digital video and still 
image compression standards are based on the newer YCbCr conversion. YCbCr is closely 
related to YUV (the Cy, and C, signals carry the blue and red color difference signals and have 
more useful mnemonics than UV) but uses different scale factors to fit within the eight-bit 
range available with digital signals. 

For video, the Y” signal is re-scaled to fit within the [16...235] range of values, while 
the Cb and Cr signals are scaled to fit within [16...240] (Gomes and Velho 1997; Fairchild 
2013). For still images, the JPEG standard uses the full eight-bit range with no reserved 


values, 
Y’ 0.299 0.587 0.114 R' 0 
Cy} = |—0.168736 —0.331264 0.5 G| + | 128] , (2.116) 
Cr 0.5 —0.418688 —0.081312| | B’ 128 


where the R'G'B” values are the eight-bit gamma-compressed color components (i.e., the 
actual RGB values we obtain when we open up or display a JPEG image). For most appli- 
cations, this formula is not that important, since your image reading software will directly 
provide you with the eight-bit gamma-compressed R’G’B’ values. However, if you are trying 
to do careful image deblocking (Exercise 4.3), this information may be useful. 

Another color space you may come across is hue, saturation, value (HSV), which is a 
projection of the RGB color cube onto a non-linear chroma angle, a radial saturation per- 
centage, and a luminance-inspired value. In more detail, value is defined as either the mean 
or maximum color value, saturation is defined as scaled distance from the diagonal, and hue 
is defined as the direction around a color wheel (the exact formulas are described by Hall 
(1989), Hughes, van Dam et al. (2013), and Brown (2019)). Such a decomposition is quite 
natural in graphics applications such as color picking (1t approximates the Munsell chart for 
color description). Figure 2.33l-n shows an HSV representation of a sample color image, 
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where saturation is encoded using a gray scale (saturated = darker) and hue is depicted as a 
color. 

If you want your computer vision algorithm to only affect the value (luminance) of an 
image and not its saturation or hue, a simpler solution is to use either the Y xy (luminance + 


chromaticity) coordinates defined in (2.105) or the even simpler color ratios, 


R G B 


E SA PE <A 2.117 
(CRECER * R}GQG4}B' R+G+B edt) 


(Figure 2.33e—h). After manipulating the luma (2.113), e.g., through the process of histogram 
equalization (Section 3.1.4), you can multiply each color ratio by the ratio of the new to old 
luma to obtain an adjusted RGB triplet. 

While all of these color systems may sound confusing, in the end, it often may not mat- 
ter that much which one you use. Poynton, in his Color FAQ, https://www.poynton.com/ 
ColorFAQ.html, notes that the perceptually motivated L*a*b* system is qualitatively similar 
to the gamma-compressed R’G’B’ system we mostly deal with, since both have a fractional 
power scaling (which approximates a logarithmic response) between the actual intensity val- 
ues and the numbers being manipulated. As in all cases, think carefully about what you are 


trying to accomplish before deciding on a technique to use. 


2.3.3 Compression 


The last stage in a camera’s processing pipeline is usually some form of image compression 
(unless you are using a lossless compression scheme such as camera RAW or PNG). 

All color video and image compression algorithms start by converting the signal into 
YCbCr (or some closely related variant), so that they can compress the luminance signal with 
higher fidelity than the chrominance signal. (Recall that the human visual system has poorer 
frequency response to color than to luminance changes.) In video, it is common to subsam- 
ple Cb and Cr by a factor of two horizontally; with still images (JPEG), the subsampling 
(averaging) occurs both horizontally and vertically. 

Once the luminance and chrominance images have been appropriately subsampled and 
separated into individual images, they are then passed to a block transform stage. The most 
common technique used here is the discrete cosine transform (DCT), which is a real-valued 
variant of the discrete Fourier transform (DFT) (see Section 3.4.1). The DCT is a reasonable 
approximation to the Karhunen—Loéve or eigenvalue decomposition of natural image patches, 
i.e., the decomposition that simultaneously packs the most energy into the first coefficients 
and diagonalizes the joint covariance matrix among the pixels (makes transform coefficients 
statistically independent). Both MPEG and JPEG use 8 x 8 DCT transforms (Wallace 1991; 
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(a) RGB 


(e) rgb (1) r (8) g (h) b 


(1) H 


Figure 2.33 Color space transformations: (a—d) RGB; (e—h) rgb. (i-k) L*a*b*; (l-n) HSV. 
Note that the rgb, L*a*b*, and HSV values are all re-scaled to fit the dynamic range of the 
printed page. 
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Figure 2.34 Image compressed with JPEG at three quality settings. Note how the amount 
of block artifact and high-frequency aliasing (“mosquito noise”) increases from left to right. 


Le Gall 1991), although newer variants, including the new AV1 open standard, use smaller 
4 x 4 or even 2 x 2 blocks. Alternative transformations, such as wavelets (Taubman and 
Marcellin 2002) and lapped transforms (Malvar 1990, 1998, 2000) are used in compression 
standards such as JPEG 2000 and JPEG XR. 

After transform coding, the coefficient values are quantized into a set of small integer 
values that can be coded using a variable bit length scheme such as a Huffman code or an 
arithmetic code (Wallace 1991; Marpe, Schwarz, and Wiegand 2003). (The DC (lowest fre- 
quency) coefficients are also adaptively predicted from the previous block’s DC values. The 
term “DC” comes from “direct current’, i.e., the non-sinusoidal or non-alternating part of a 
signal.) The step size in the quantization is the main variable controlled by the quality setting 
on the JPEG file (Figure 2.34). 

With video, it is also usual to perform block-based motion compensation, i.e., to encode 
the difference between each block and a predicted set of pixel values obtained from a shifted 
block in the previous frame. (The exception is the motion-JPEG scheme used in older DV 
camcorders, which is nothing more than a series of individually JPEG compressed image 
frames.) While basic MPEG uses 16 x 16 motion compensation blocks with integer motion 
values (Le Gall 1991), newer standards use adaptively sized blocks, sub-pixel motions, and 
the ability to reference blocks from older frames (Sullivan, Ohm et al. 2012). In order to 
recover more gracefully from failures and to allow for random access to the video stream, 
predicted P frames are interleaved among independently coded I frames. (Bi-directional B 
frames are also sometimes used.) 

The quality of a compression algorithm is usually reported using its peak signal-to-noise 


ratio (PSNR), which is derived from the average mean square error, 


MSE = : y K z i(x)| f (2.118) 


30https://aomedia.org 
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where I(x) is the original uncompressed image and /(x) is its compressed counterpart, or 


equivalently, the root mean square error (RMS error), which is defined as 


RMS =vVMSE. (2.119) 
The PSNR is defined as 
PSNR = 10lo Tnax = 20 lo Imax (2.120) 
~~ S10 gE ^ 0 RMS’ 


where Imax is the maximum signal extent, e.g., 255 for eight-bit images. 

While this is just a high-level sketch of how image compression works, it is useful to 
understand so that the artifacts introduced by such techniques can be compensated for in 
various computer vision applications. Note also that researchers are currently developing 
novel image and video compression algorithms based on deep neural networks, e.g., (Rippel 
and Bourdev 2017; Mentzer, Agustsson ef al. 2019; Rippel, Nair et al. 2019) and https://www. 
compression.cc. It will be interesting to see what kinds of different artifacts these techniques 


produce. 


2.4 Additional reading 


As we mentioned at the beginning of this chapter, this book provides but a brief summary of 
a very rich and deep set of topics, traditionally covered in a number of separate fields. 

A more thorough introduction to the geometry of points, lines, planes, and projections 
can be found in textbooks on multi-view geometry (Faugeras and Luong 2001; Hartley and 
Zisserman 2004) and computer graphics (Watt 1995; OpenGL-ARB 1997; Hughes, van Dam 
et al. 2013; Marschner and Shirley 2015). Topics covered in more depth include higher- 
order primitives such as quadrics, conics, and cubics, as well as three-view and multi-view 
geometry. 

The image formation (synthesis) process is traditionally taught as part of a computer 
graphics curriculum (Glassner 1995; Watt 1995; Hughes, van Dam et al. 2013; Marschner 
and Shirley 2015) but it is also studied in physics-based computer vision (Wolff, Shafer, and 
Healey 1992a). The behavior of camera lens systems is studied in optics (Möller 1988; Ray 
2002; Hecht 2015). 

Some good books on color theory have been written by Healey and Shafer (1992), Wan- 
dell (1995), Wyszecki and Stiles (2000), and Fairchild (2013), with Livingstone (2008) pro- 
viding a more fun and informal introduction to the topic of color perception. Mark Fairchild’s 


page of color books and links*! lists many other sources. 


3! http://markfairchild.org/WhyIsColor/books_links.html. 
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Topics relating to sampling and aliasing are covered in textbooks on signal and image 
processing (Crane 1997; Jáhne 1997; Oppenheim and Schafer 1996; Oppenheim, Schafer, 
and Buck 1999; Pratt 2007; Russ 2007; Burger and Burge 2008; Gonzalez and Woods 2017). 

Two courses that cover many of the above topics (image formation, lenses, color and sam- 
pling theory) in wonderful detail are Marc Levoy’s Digital Photography course at Stanford 
(Levoy 2010) and Michael Brown's tutorial on the image processing pipeline at ICCV 2019 
(Brown 2019). The recent book by Ikeuchi, Matsushita et al. (2020) also covers 3D geometry, 


photometry, and sensor models, but with an emphasis on active illumination systems. 


2.5 Exercises 


A note to students: This chapter is relatively light on exercises since it contains mostly 
background material and not that many usable techniques. If you really want to understand 
multi-view geometry in a thorough way, I encourage you to read and do the exercises provided 
by Hartley and Zisserman (2004). Similarly, if you want some exercises related to the image 


formation process, Glassner’s (1995) book is full of challenging problems. 


Ex 2.1: Least squares intersection point and line fitting—advanced. Equation (2.4) shows 
how the intersection of two 2D lines can be expressed as their cross product, assuming the 


lines are expressed as homogeneous coordinates. 


1. If you are given more than two lines and want to find a point x that minimizes the sum 


of squared distances to each line, 


D= (5:17, (2.121) 


how can you compute this quantity? (Hint: Write the dot product as xT]; and turn the 


squared quantity into a quadratic form, XT AX.) 


2. To fit a line to a bunch of points, you can compute the centroid (mean) of the points 
as well as the covariance matrix of the points around this mean. Show that the line 
passing through the centroid along the major axis of the covariance ellipsoid (largest 
eigenvector) minimizes the sum of squared distances to the points. 


3. These two approaches are fundamentally different, even though projective duality tells 
us that points and lines are interchangeable. Why are these two algorithms so appar- 


ently different? Are they actually minimizing different objectives? 


2.5 Exercises 103 


Ex 2.2: 2D transform editor. Write a program that lets you interactively create a set of 
rectangles and then modify their “pose” (2D transform). You should implement the following 


steps: 
1. Open an empty window (“canvas”). 
2. Shift drag (rubber-band) to create a new rectangle. 


3. Select the deformation mode (motion model): translation, rigid, similarity, affine, or 


perspective. 
4. Drag any corner of the outline to change its transformation. 


This exercise should be built on a set of pixel coordinate and transformation classes, either 
implemented by yourself or from a software library. Persistence of the created representation 


(save and load) should also be supported (for each rectangle, save its transformation). 


Ex 2.3: 3D viewer. Write a simple viewer for 3D points, lines, and polygons. Import a set 
of point and line commands (primitives) as well as a viewing transform. Interactively modify 
the object or camera transform. This viewer can be an extension of the one you created in 
Exercise 2.2. Simply replace the viewing transformations with their 3D equivalents. 
(Optional) Add a z-buffer to do hidden surface removal for polygons. 
(Optional) Use a 3D drawing package and just write the viewer control. 


Ex 2.4: Focus distance and depth of field. Figure out how the focus distance and depth of 
field indicators on a lens are determined. 


1. Compute and plot the focus distance z, as a function of the distance traveled from the 
focal length Az; = f — z; for a lens of focal length f (say, 100mm). Does this explain 


the hyperbolic progression of focus distances you see on a typical lens (Figure 2.20)? 


2. Compute the depth of field (minimum and maximum focus distances) for a given focus 
setting z, as a function of the circle of confusion diameter c (make it a fraction of 
the sensor width), the focal length f, and the f-stop number N (which relates to the 
aperture diameter d). Does this explain the usual depth of field markings on a lens that 


bracket the in-focus marker, as in Figure 2.20a? 


3. Now consider a zoom lens with a varying focal length f. Assume that as you zoom, 
the lens stays in focus, i.e., the distance from the rear nodal point to the sensor plane 
zi adjusts itself automatically for a fixed focus distance zo. How do the depth of field 
indicators vary as a function of focal length? Can you reproduce a two-dimensional 


plot that mimics the curved depth of field lines seen on the lens in Figure 2.20b? 
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Ex 2.5: F-numbers and shutter speeds. List the common f-numbers and shutter speeds 
that your camera provides. On older model SLRs, they are visible on the lens and shut- 
ter speed dials. On newer cameras, you have to look at the electronic viewfinder (or LCD 
screen/indicator) as you manually adjust exposures. 


1. Do these form geometric progressions; if so, what are the ratios? How do these relate 


to exposure values (EVs)? 


1 
125° 


exactly a factor of two apart or a factor of 125/60 = 2.083 apart? 


2. If your camera has shutter speeds of 5 and do you think that these two speeds are 


3. How accurate do you think these numbers are? Can you devise some way to measure 
exactly how the aperture affects how much light reaches the sensor and what the exact 


exposure times actually are? 


Ex 2.6: Noise level calibration. Estimate the amount of noise in your camera by taking 
repeated shots of a scene with the camera mounted on a tripod. (Purchasing a remote shutter 
release is a good investment if you own a DSLR.) Alternatively, take a scene with constant 
color regions (such as a color checker chart) and estimate the variance by fitting a smooth 


function to each color region and then taking differences from the predicted function. 


1. Plot your estimated variance as a function of level for each of your color channels 


separately. 


2. Change the ISO setting on your camera; if you cannot do that, reduce the overall light 
in your scene (turn off lights, draw the curtains, wait until dusk). Does the amount of 


noise vary a lot with ISO/gain? 


3. Compare your camera to another one at a different price point or year of make. Is 
there evidence to suggest that “you get what you pay for”? Does the quality of digital 


cameras seem to be improving over time? 


Ex 2.7: Gamma correction in image stitching. Here's a relatively simple puzzle. Assume 
you are given two images that are part of a panorama that you want to stitch (see Section 8.2). 
The two images were taken with different exposures, so you want to adjust the RGB values 
so that they match along the seam line. Is it necessary to undo the gamma in the color values 


in order to achieve this? 


Ex 2.8: White point balancing—tricky. A common (in-camera or post-processing) tech- 
nique for performing white point adjustment is to take a picture of a white piece of paper and 


to adjust the RGB values of an image to make this a neutral color. 


2.5 Exercises 105 


1. Describe how you would adjust the RGB values in an image given a sample “white 
color” of (Rw, Gw, Bw) to make this color neutral (without changing the exposure too 


much). 


2. Does your transformation involve a simple (per-channel) scaling of the RGB values or 
do you need a full 3 x 3 color twist matrix (or something else)? 


3. Convert your RGB values to XYZ. Does the appropriate correction now only depend 
on the XY (or xy) values? If so, when you convert back to RGB space, do you need a 


full 3 x 3 color twist matrix to achieve the same effect? 


4. If you used pure diagonal scaling in the direct RGB mode but end up with a twist if you 
work in XYZ space, how do you explain this apparent dichotomy? Which approach is 
correct? (Or is 1t possible that neither approach is actually correct?) 


If you want to find out what your camera actually does, continue on to the next exercise. 


Ex 2.9: In-camera color processing—challenging. If your camera supports a RAW pixel 
mode, take a pair of RAW and JPEG images, and see if you can infer what the camera is doing 
when it converts the RAW pixel values to the final color-corrected and gamma-compressed 
eight-bit JPEG pixel values. 


1. Deduce the pattern in your color filter array from the correspondence between co- 
located RAW and color-mapped pixel values. Use a color checker chart at this stage 
1f it makes your life easier. You may find it helpful to split the RAW image into four 
separate images (subsampling even and odd columns and rows) and to treat each of 


these new images as a “virtual” sensor. 


2. Evaluate the quality of the demosaicing algorithm by taking pictures of challenging 


scenes which contain strong color edges (such as those shown in in Section 10.3.1). 


3. If you can take the same exact picture after changing the color balance values in your 


camera, compare how these settings affect this processing. 


4. Compare your results against those presented in (Chakrabarti, Scharstein, and Zickler 
2009), Kim, Lin et al. (2012), Hasinoff, Sharlet et al. (2016), Karaimer and Brown 
(2016), and Brooks, Mildenhall et al. (2019) or use the data available in their database 


of color images. 
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Figure 3.1 Some common image processing operations: (a) partial histogram equaliza- 
tion; (b) orientation map computed from the second-order steerable filter (Freeman 1992) O 
1992 IEEE; (c) bilateral filter (Durand and Dorsey 2002) O 2002 ACM; (d) image pyramid; 
(e) Laplacian pyramid blending (Burt and Adelson 1983b) © 1983 ACM; (f) line-based image 
warping (Beier and Neely 1992) © 1992 ACM. 
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Now that we have seen how images are formed through the interaction of 3D scene elements, 
lighting, and camera optics and sensors, let us look at the first stage in most computer vision 
algorithms, namely the use of image processing to preprocess the image and convert it into a 
form suitable for further analysis. Examples of such operations include exposure correction 
and color balancing, reducing image noise, increasing sharpness, or straightening the image 
by rotating it. Additional examples include image warping and image blending, which are 
often used for visual effects (Figures 3.1 and Section 3.6.3). While some may consider image 
processing to be outside the purview of computer vision, most computer vision applications, 
such as computational photography and even recognition, require care in designing the image 
processing stages to achieve acceptable results. 

In this chapter, we review standard image processing operators that map pixel values from 
one image to another. Image processing is often taught in electrical engineering departments 
as a follow-on course to an introductory course in signal processing (Oppenheim and Schafer 
1996; Oppenheim, Schafer, and Buck 1999). There are several popular textbooks for image 
processing, including Gomes and Velho (1997), Jáhne (1997), Pratt (2007), Burger and Burge 
(2009), and Gonzalez and Woods (2017). 

We begin this chapter with the simplest kind of image transforms, namely those that 
manipulate each pixel independently of its neighbors (Section 3.1). Such transforms are of- 
ten called point operators or point processes. Next, we examine neighborhood (area-based) 
operators, where each new pixel's value depends on a small number of neighboring input 
values (Sections 3.2 and 3.3). A convenient tool to analyze (and sometimes accelerate) such 
neighborhood operations is the Fourier Transform, which we cover in Section 3.4. Neighbor- 
hood operators can be cascaded to form image pyramids and wavelets, which are useful for 
analyzing images at a variety of resolutions (scales) and for accelerating certain operations 
(Section 3.5). Another important class of global operators are geometric transformations, 
such as rotations, shears, and perspective deformations (Section 3.6). 

While this chapter covers classical image processing techniques that consist mostly of 
linear and non-linear filtering operations, the next two chapters introduce energy-based and 
Bayesian graphical models, i.e., Markov random fields (Chapter 4), and then deep convolu- 
tional networks (Chapter 5), both of which are now widely used in image processing applica- 


tions. 


3.1 Point operators 


The simplest kinds of image processing transforms are point operators, where each output 


pixel's value depends on only the corresponding input pixel value (plus, potentially, some 
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Figure 3.2 Some local image processing operations: (a) original image along with its 
three color (per-channel) histograms; (b) brightness increased (additive offset, b = 16); (c) 
contrast increased (multiplicative gain, a = 1.1); (d) gamma (partially) linearized (y = 1.2); 
(e) full histogram equalization; (f) partial histogram equalization. 
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(b) (d) 


Figure 3.3 Visualizing image data: (a) original image; (b) cropped portion and scanline 
plot using an image inspection tool; (c) grid of numbers; (d) surface plot. For figures (c)-(d), 


the image was first converted to grayscale. 


globally collected information or parameters). Examples of such operators include brightness 
and contrast adjustments (Figure 3.2) as well as color correction and transformations. In the 
image processing literature, such operations are also known as point processes (Crane 1997).! 
We begin this section with a quick review of simple point operators, such as brightness 
scaling and image addition. Next, we discuss how colors in images can be manipulated. 
We then present image compositing and matting operations, which play an important role 
in computational photography (Chapter 10) and computer graphics applications. Finally, we 
describe the more global process of histogram equalization. We close with an example appli- 
cation that manipulates tonal values (exposure and contrast) to improve image appearance. 


3.1.1 Pixel transforms 


A general image processing operator is a function that takes one or more input images and 
produces an output image. In the continuous domain, this can be denoted as 


g(x) =h(f(x)) or g(x)=hMfo(3),--->fn(x)), (3.1) 


where x is in the D-dimensional (usually D = 2 for images) domain of the input and output 
functions f and g, which operate over some range, which can either be scalar or vector- 
valued, e.g., for color images or 2D motion. For discrete (sampled) images, the domain 


consists of a finite number of pixel locations, x = (i, j), and we can write 


g(t, j) =h(F(0, 3). (3.2) 


Figure 3.3 shows how an image can be represented either by its color (appearance), as a grid 


of numbers, or as a two-dimensional function (surface plot). 


ln convolutional neural networks (Section 5.4), such operations are sometimes called 1 x 1 convolutions. 
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Two commonly used point processes are multiplication and addition with a constant, 
g(x) = af(x) +b. (3.3) 


The parameters a > 0 and b are often called the gain and bias parameters; sometimes these 
parameters are said to control contrast and brightness, respectively (Figures 3.2b-c).? The 
bias and gain parameters can also be spatially varying, 


g(x) = a(x) f(x) + b(x), (3.4) 


e.g., when simulating the graded density filter used by photographers to selectively darken 
the sky or when modeling vignetting in an optical system. 
Multiplicative gain (both global and spatially varying) is a linear operation, as it obeys 


the superposition principle, 
hl fo + fr) = Alfo) + h(f1). (3.5) 


(We will have more to say about linear shift invariant operators in Section 3.2.) Operators 
such as image squaring (which is often used to get a local estimate of the energy in a band- 
pass filtered signal, see Section 3.5) are not linear. 


Another commonly used dyadic (two-input) operator is the linear blend operator, 
g(x) = (1 — a) fo(x) + a fı (x). (3.6) 


By varying a from 0 — 1, this operator can be used to perform a temporal cross-dissolve 
between two images or videos, as seen in slide shows and film production, or as a component 
of image morphing algorithms (Section 3.6.3). 

One highly used non-linear transform that is often applied to images before further pro- 
cessing is gamma correction, which is used to remove the non-linear mapping between input 
radiance and quantized pixel values (Section 2.3.2). To invert the gamma mapping applied 
by the sensor, we can use 


g(x) = [F , (3.7) 


where a gamma value of y 2.2 is a reasonable fit for most digital cameras. 


3.1.2 Color transforms 


While color images can be treated as arbitrary vector-valued functions or collections of inde- 


pendent bands, it usually makes sense to think about them as highly correlated signals with 


2 An image’s luminance characteristics can also be summarized by its key (average luminance) and range (Kopf, 
Uyttendaele et al. 2007). 
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a) (b) (c) 


Figure 3.4 Image matting and compositing (Chuang, Curless et al. 2001) © 2001 IEEE: 


(a) source image; (b) extracted foreground object F; (c) alpha matte a shown in grayscale; 


( (d) 


(d) new composite C. 


strong connections to the image formation process (Section 2.2), sensor design (Section 2.3), 
and human perception (Section 2.3.2). Consider, for example, brightening a picture by adding 
a constant value to all three channels, as shown in Figure 3.2b. Can you tell if this achieves the 
desired effect of making the image look brighter? Can you see any undesirable side-effects 
or artifacts? 

In fact, adding the same value to each color channel not only increases the apparent in- 
tensity of each pixel, it can also affect the pixel’s hue and saturation. How can we define and 
manipulate such quantities in order to achieve the desired perceptual effects? 

As discussed in Section 2.3.2, chromaticity coordinates (2.105) or even simpler color ra- 
tios (2.117) can first be computed and then used after manipulating (e.g., brightening) the 
luminance Y to re-compute a valid RGB image with the same hue and saturation. Figures 
2.33f-h show some color ratio images multiplied by the middle gray value for better visual- 
ization. 

Similarly, color balancing (e.g., to compensate for incandescent lighting) can be per- 
formed either by multiplying each channel with a different scale factor or by the more com- 
plex process of mapping to XYZ color space, changing the nominal white point, and mapping 
back to RGB, which can be written down using a linear 3 x 3 color twist transform matrix. 
Exercises 2.8 and 3.1 have you explore some of these issues. 

Another fun project, best attempted after you have mastered the rest of the material in 
this chapter, is to take a picture with a rainbow in it and enhance the strength of the rainbow 
(Exercise 3.29). 


3.1.3 Compositing and matting 


In many photo editing and visual effects applications, it is often desirable to cut a foreground 
object out of one scene and put it on top of a different background (Figure 3.4). The process 


of extracting the object from the original image is often called matting (Smith and Blinn 
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Figure 3.5 Compositing equation C = (1 — a)B + aF. The images are taken from a 
close-up of the region of the hair in the upper right part of the lion in Figure 3.4. 


1996), while the process of inserting it into another image (without visible artifacts) is called 
compositing (Porter and Duff 1984; Blinn 1994a). 

The intermediate representation used for the foreground object between these two stages 
is called an alpha-matted color image (Figure 3.4b-c). In addition to the three color RGB 
channels, an alpha-matted image contains a fourth alpha channel a (or A) that describes the 
relative amount of opacity or fractional coverage at each pixel (Figures 3.4c and 3.5b). The 
opacity is the opposite of the transparency. Pixels within the object are fully opaque (a = 1), 
while pixels fully outside the object are transparent (a = 0). Pixels on the boundary of the 
object vary smoothly between these two extremes, which hides the perceptual visible jaggies 
that occur if only binary opacities are used. 

To composite a new (or foreground) image on top of an old (background) image, the over 
operator, first proposed by Porter and Duff (1984) and then studied extensively by Blinn 
(1994a; 1994b), is used: 

C=(1-a)B+aF. (3.8) 


This operator attenuates the influence of the background image B by a factor (1 — a) and 
then adds in the color (and opacity) values corresponding to the foreground layer F, as shown 
in Figure 3.5. 

In many situations, it is convenient to represent the foreground colors in pre-multiplied 
form, i.e., to store (and manipulate) the af’ values directly. As Blinn (1994b) shows, the 
pre-multiplied RGBA representation is preferred for several reasons, including the ability 
to blur or resample (e.g., rotate) alpha-matted images without any additional complications 
(just treating each RGBA band independently). However, when matting using local color 
consistency (Ruzon and Tomasi 2000; Chuang, Curless ef al. 2001), the pure un-multiplied 
foreground colors F are used, since these remain constant (or vary slowly) in the vicinity of 
the object edge. 

The over operation is not the only kind of compositing operation that can be used. Porter 
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Figure 3.6 An example of light reflecting off the transparent glass of a picture frame (Black 
and Anandan 1996) © 1996 Elsevier. You can clearly see the woman’s portrait inside the 


picture frame superimposed with the reflection of a man’s face off the glass. 


and Duff (1984) describe a number of additional operations that can be useful in photo editing 
and visual effects applications. In this book, we concern ourselves with only one additional 
commonly occurring case (but see Exercise 3.3). 

When light reflects off clean transparent glass, the light passing through the glass and 
the light reflecting off the glass are simply added together (Figure 3.6). This model is use- 
ful in the analysis of transparent motion (Black and Anandan 1996; Szeliski, Avidan, and 
Anandan 2000), which occurs when such scenes are observed from a moving camera (see 
Section 9.4.2). 

The actual process of matting, i.e., recovering the foreground, background, and alpha 
matte values from one or more images, has a rich history, which we study in Section 10.4. 
Smith and Blinn (1996) have a nice survey of traditional blue-screen matting techniques, 
while Toyama, Krumm et al. (1999) review difference matting. Since then, there has been 
a lot of activity in computational photography relating to natural image matting (Ruzon and 
Tomasi 2000; Chuang, Curless et al. 2001; Wang and Cohen 2009; Xu, Price et al. 2017), 
which attempts to extract the mattes from a single natural image (Figure 3.4a) or from ex- 
tended video sequences (Chuang, Agarwala et al. 2002). All of these techniques are described 
in more detail in Section 10.4. 


3.1.4 Histogram equalization 


While the brightness and gain controls described in Section 3.1.1 can improve the appearance 
of an image, how can we automatically determine their best values? One approach might 
be to look at the darkest and brightest pixel values in an image and map them to pure black 
and pure white. Another approach might be to find the average value in the image, push it 
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Figure 3.7 Histogram analysis and equalization: (a) original image; (b) color channel 
and intensity (luminance) histograms; (c) cumulative distribution functions; (d) equalization 


(transfer) functions; (e) full histogram equalization; (f) partial histogram equalization. 


towards middle gray, and expand the range so that it more closely fills the displayable values 
(Kopf, Uyttendaele et al. 2007). 

How can we visualize the set of lightness values in an image to test some of these heuris- 
tics? The answer is to plot the histogram of the individual color channels and luminance 
values, as shown in Figure 3.7b.? From this distribution, we can compute relevant statistics 
such as the minimum, maximum, and average intensity values. Notice that the image in Fig- 
ure 3.7a has both an excess of dark values and light values, but that the mid-range values are 
largely under-populated. Would it not be better if we could simultaneously brighten some 
dark values and darken some light values, while still using the full extent of the available 
dynamic range? Can you think of a mapping that might do this? 

One popular answer to this question is to perform histogram equalization, i.e., to find 
an intensity mapping function f(T) such that the resulting histogram is flat. The trick to 
finding such a mapping is the same one that people use to generate random samples from 


a probability density function, which is to first compute the cumulative distribution function 


3The histogram is simply the count of the number of pixels at each gray level value. For an eight-bit image, an 
accumulation table with 256 entries is needed. For higher bit depths, a table with the appropriate number of entries 
(probably fewer than the full number of gray levels) should be used. 
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shown in Figure 3.7c. 

Think of the original histogram h(I) as the distribution of grades in a class after some 
exam. How can we map a particular grade to its corresponding percentile, so that students at 
the 75% percentile range scored better than 3/4 of their classmates? The answer is to integrate 
the distribution A(T) to obtain the cumulative distribution c(Z), 


c(I) = Ny h(i) =c(1-1)+ Ta, (3.9) 


where N is the number of pixels in the image or students in the class. For any given grade or 
intensity, we can look up its corresponding percentile c(7) and determine the final value that 
the pixel should take. When working with eight-bit pixel values, the J and c axes are rescaled 
from [0, 255). 

Figure 3.7e shows the result of applying f(1) = c(I) to the original image. As we 
can see, the resulting histogram is flat; so is the resulting image (it is “flat” in the sense 
of a lack of contrast and being muddy looking). One way to compensate for this is to only 
partially compensate for the histogram unevenness, e.g., by using a mapping function f(T) = 
ac(I) + (1 — a) 1, which is a linear blend between the cumulative distribution function and 
the identity transform (a straight line). As you can see in Figure 3.7f, the resulting image 
maintains more of its original grayscale distribution while having a more appealing balance. 

Another potential problem with histogram equalization (or, in general, image brightening) 
1s that noise in dark regions can be amplified and become more visible. Exercise 3.7 suggests 
some possible ways to mitigate this, as well as alternative techniques to maintain contrast and 


“punch” in the original images (Larson, Rushmeier, and Piatko 1997; Stark 2000). 


Locally adaptive histogram equalization 


While global histogram equalization can be useful, for some images it might be preferable 
to apply different kinds of equalization in different regions. Consider for example the image 
in Figure 3.8a, which has a wide range of luminance values. Instead of computing a single 
curve, what if we were to subdivide the image into M x M pixel blocks and perform separate 
histogram equalization in each sub-block? As you can see in Figure 3.8b, the resulting image 
exhibits a lot of blocking artifacts, 1.e., intensity discontinuities at block boundaries. 

One way to eliminate blocking artifacts is to use a moving window, 1.e., to recompute the 
histogram for every M x M block centered at each pixel. This process can be quite slow 
(M? operations per pixel), although with clever programming only the histogram entries 


corresponding to the pixels entering and leaving the block (in a raster scan across the image) 
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(a) 


Figure 3.8 Locally adaptive histogram equalization: (a) original image; (b) block his- 


togram equalization; (c) full locally adaptive equalization. 


need to be updated (M operations per pixel). Note that this operation is an example of the 
non-linear neighborhood operations we study in more detail in Section 3.3.1. 

A more efficient approach is to compute non-overlapped block-based equalization func- 
tions as before, but to then smoothly interpolate the transfer functions as we move between 
blocks. This technique is known as adaptive histogram equalization (AHE) and its contrast- 
limited (gain-limited) version is known as CLAHE (Pizer, Amburn et al. 1987).* The weight- 
ing function for a given pixel (i, j) can be computed as a function of its horizontal and vertical 
position (s,¢) within a block, as shown in Figure 3.9a. To blend the four lookup functions 
{ foo,.--, fii}, a bilinear blending function, 


fee) = (1 — 8)(1— t) fool) + 9 —t) fio) + 1 — s)tfoa l) + stfu) (8.10) 


can be used. (See Section 3.5.2 for higher-order generalizations of such spline functions.) 
Note that instead of blending the four lookup tables for each output pixel (which would be 
quite slow), we can instead blend the results of mapping a given pixel through the four neigh- 
boring lookups. 

A variant on this algorithm is to place the lookup tables at the corners of each M x M 
block (see Figure 3.9b and Exercise 3.8). In addition to blending four lookups to compute the 
final value, we can also distribute each input pixel into four adjacent lookup tables during the 
histogram accumulation phase (notice that the gray arrows in Figure 3.9b point both ways), 
1.e., 

hr (I (i, j)) += w(i, j, k,l), (3.11) 
where w(i, j, k,l) is the bilinear weighting function between pixel (i, 7) and lookup table 
(k,l). This is an example of soft histogramming, which is used in a variety of other applica- 
tions, including the construction of SIFT feature descriptors (Section 7.1.3) and vocabulary 
trees (Section 7.1.4). 


4The CLAHE algorithm is part of OpenCV. 
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Figure 3.9 Local histogram interpolation using relative (s,t) coordinates: (a) block-based 
histograms, with block centers shown as circles; (b) corner-based “spline” histograms. Pix- 
els are located on grid intersections. The black square pixel’s transfer function is interpolated 
from the four adjacent lookup tables (gray arrows) using the computed (s,t) values. Block 


boundaries are shown as dashed lines. 


3.1.5 Application: Tonal adjustment 


One of the most widely used applications of point-wise image processing operators is the 
manipulation of contrast or tone in photographs, to make them look either more attractive or 
more interpretable. You can get a good sense of the range of operations possible by opening 
up any photo manipulation tool and trying out a variety of contrast, brightness, and color 
manipulation options, as shown in Figures 3.2 and 3.7. 

Exercises 3.1, 3.6, and 3.7 have you implement some of these operations, to become 
familiar with basic image processing operators. More sophisticated techniques for tonal ad- 
justment (Bae, Paris, and Durand 2006; Reinhard, Heidrich et al. 2010) are described in the 


section on high dynamic range tone mapping (Section 10.2.1). 


3.2 Linear filtering 


Locally adaptive histogram equalization is an example of a neighborhood operator or local 
operator, which uses a collection of pixel values in the vicinity of a given pixel to determine 
its final output value (Figure 3.10). In addition to performing local tone adjustment, neigh- 
borhood operators can be used to filter images to add soft blur, sharpen details, accentuate 
edges, or remove noise (Figure 3.11b-d). In this section, we look at linear filtering operators, 


which involve fixed weighted combinations of pixels in small neighborhoods. In Section 3.3, 
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Figure 3.10 Neighborhood filtering (convolution): The image on the left is convolved with 
the filter in the middle to yield the image on the right. The light blue pixels indicate the source 
neighborhood for the light green destination pixel. 


we look at non-linear operators such as morphological filters and distance transforms. 
The most widely used type of neighborhood operator is a linear filter, where an output 


pixel’s value is a weighted sum of pixel values within a small neighborhood M (Figure 3.10), 


gli, j) = Y f+ kj + DAK, 1). (3.12) 
k,l 


The entries in the weight kernel or mask h(k, l) are often called the filter coefficients. The 


above correlation operator can be more compactly notated as 
g=f Oh. (3.13) 


A common variant on this formula is 


glij) =D Fli- FF —Dh(k, 1) = $ f(k, DAG— kg — 1), (3.14) 


k,l k,l 


where the sign of the offsets in f has been reversed, This is called the convolution operator, 
g=fx*h, (3.15) 


and A is then called the impulse response function.? The reason for this name is that the kernel 
function, h, convolved with an impulse signal, 5(7, j) (an image that is O everywhere except 
at the origin) reproduces itself, h x 6 = h, whereas correlation produces the reflected signal. 
(Try this yourself to verify that it is so.) 


>The continuous version of convolution can be written as g(x) = f f(x — u)h(u)du. 
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(a) (b) 


(2) (h) 


Figure 3.11 Some neighborhood operations: (a) original image; (b) blurred; (c) sharp- 
ened; (d) smoothed with edge-preserving filter; (e) binary image; (f) dilated; (g) distance 
transform; (h) connected components. For the dilation and connected components, black 
(ink) pixels are assumed to be active, i.e., to have a value of 1 in Equations (3.44-3.48). 
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Figure 3.12 One-dimensional signal convolution as a sparse matrix-vector multiplication, 


g = Hf. 


In fact, Equation (3.14) can be interpreted as the superposition (addition) of shifted im- 
pulse response functions h(i — k, j — l) multiplied by the input pixel values f (k,l). Convolu- 
tion has additional nice properties, e.g., it is both commutative and associative. As well, the 
Fourier transform of two convolved images is the product of their individual Fourier trans- 
forms (Section 3.4). 

Both correlation and convolution are linear shift-invariant (LSI) operators, which obey 
both the superposition principle (3.5), 


ho(fo+fi)=hofotho fi, (3.16) 


and the shift invariance principle, 
90, )=fG+kj+0D & (hog)(ij)= (ho f)(i+k,j +1, (3.17) 


which means that shifting a signal commutes with applying the operator (o stands for the LSI 
operator). Another way to think of shift invariance is that the operator “behaves the same 
everywhere”. 


Occasionally, a shift-variant version of correlation or convolution may be used, e.g., 


gi, j) =X Jli- k, j — Dh(k, l;i, j), (3.18) 
k,l 


where h(k,l;i, j) is the convolution kernel at pixel (i,j). For example, such a spatially 
varying kernel can be used to model blur in an image due to variable depth-dependent defocus. 
Correlation and convolution can both be written as a matrix-vector multiplication, if we 
first convert the two-dimensional images f (i, j) and g(i, j) into raster-ordered vectors f and 

8- 
g= Hf, (3.19) 


where the (sparse) H matrix contains the convolution kernels. Figure 3.12 shows how a 
one-dimensional convolution can be represented in matrix-vector form. 
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mirror 


blurred zero normalized zero blurred clamp blurred mirror 


Figure 3.13 Border padding (top row) and the results of blurring the padded image (bottom 


row). 


The normalized zero image is the result of dividing (normalizing) the blurred zero- 


padded RGBA image by its corresponding soft alpha value. 


Padding (border effects) 


The astute reader will notice that the correlation shown in Figure 3.10 produces a result that 


is smaller than the original image, which may not be desirable in many applications.° This is 


because the neighborhoods of typical correlation and convolution operations extend beyond 


the image boundaries near the edges, and so the filtered images suffer from boundary effects 


To deal with this, a number of different padding or extension modes have been developed 


for neighborhood operations (Figure 3.13): 


zero: set all pixels outside the source image to 0 (a good choice for alpha-matted cutout 


images); 


constant (border color): set all pixels outside the source image to a specified border 
value; 


clamp (replicate or clamp to edge): repeat edge pixels indefinitely; 
(cyclic) wrap (repeat or tile): loop “around” the image in a “toroidal” configuration; 


mirror: reflect pixels across the image edge; 


6Note, however, that early convolutional networks such as LeNet (LeCun, Bottou et al. 1998) adopted this struc- 


ture. 
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e extend: extend the signal by subtracting the mirrored version of the signal from the 


edge pixel value. 


In the computer graphics literature (Akenine-Móller and Haines 2002, p. 124), these mech- 
anisms are known as the wrapping mode (OpenGL) or texture addressing mode (Direct3D). 
The formulas for these modes are left to the reader (Exercise 3.9). 

Figure 3.13 shows the effects of padding an image with each of the above mechanisms and 
then blurring the resulting padded image. As you can see, zero padding darkens the edges, 
clamp (replication) padding propagates border values inward, mirror (reflection) padding pre- 
serves colors near the borders. Extension padding (not shown) keeps the border pixels fixed 
(during blur). 

An alternative to padding is to blur the zero-padded RGBA image and to then divide the 
resulting image by its alpha value to remove the darkening effect. The results can be quite 


good, as seen in the normalized zero image in Figure 3.13. 


3.2.1 Separable filtering 


The process of performing a convolution requires K? (multiply-add) operations per pixel, 
where K is the size (width or height) of the convolution kernel, e.g., the box filter in Fig- 
ure 3.14a. In many cases, this operation can be significantly sped up by first performing a 
one-dimensional horizontal convolution followed by a one-dimensional vertical convolution, 
which requires a total of 2K operations per pixel. A convolution kernel for which this is 
possible is said to be separable. 

It is easy to show that the two-dimensional kernel K corresponding to successive con- 
volution with a horizontal kernel h and a vertical kernel v is the outer product of the two 
kernels, 

K = vh? (3.20) 


(see Figure 3.14 for some examples). Because of the increased efficiency, the design of 
convolution kernels for computer vision applications is often influenced by their separability. 
How can we tell if a given kernel K is indeed separable? This can often be done by 
inspection or by looking at the analytic form of the kernel (Freeman and Adelson 1991). A 
more direct method is to treat the 2D kernel as a 2D matrix K and to take its singular value 

decomposition (SVD), 
K =) ouv? (3.21) 


(see Appendix A.1.1 for the definition of the SVD). If only the first singular value oo is 


non-zero, the kernel is separable and ,/9guo and , /Tove provide the vertical and horizontal 
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Figure 3.14 Separable linear filters: For each image (a)-(e), we show the 2D filter kernel 
(top), the corresponding horizontal 1D kernel (middle), and the filtered image (bottom). The 
filtered Sobel and corner images are signed, scaled up by 2x and 4x, respectively, and added 
to a gray offset before display. 


kernels (Perona 1995). For example, the Laplacian of Gaussian kernel (3.26 and 7.23) can be 
implemented as the sum of two separable filters (7.24) (Wiejak, Buxton, and Buxton 1985). 
What if your kernel is not separable and yet you still want a faster way to implement 
it? Perona (1995), who first made the link between kernel separability and SVD, suggests 
using more terms in the (3.21) series, i.e., summing up a number of separable convolutions. 
Whether this is worth doing or not depends on the relative sizes of K and the number of sig- 
nificant singular values, as well as other considerations, such as cache coherency and memory 


locality. 


3.2.2 Examples of linear filtering 


Now that we have described the process for performing linear filtering, let us examine a 
number of frequently used filters. 

The simplest filter to implement is the moving average or box filter, which simply averages 
the pixel values in a K x K window. This is equivalent to convolving the image with a kernel 
of all ones and then scaling (Figure 3.14a). For large kernels, a more efficient implementation 
is to slide a moving window across each scanline (in a separable filter) while adding the 
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newest pixel and subtracting the oldest pixel from the running sum. This is related to the 
concept of summed area tables, which we describe shortly. 

A smoother image can be obtained by separably convolving the image with a piecewise 
linear “tent” function (also known as a Bartlett filter). Figure 3.14b shows a 3 x 3 version 
of this filter, which is called the bilinear kernel, since it is the outer product of two linear 
(first-order) splines (see Section 3.5.2). 

Convolving the linear tent function with itself yields the cubic approximating spline, 
which is called the “Gaussian” kernel (Figure 3.14c) in Burt and Adelson's (1983a) Lapla- 
cian pyramid representation (Section 3.5). Note that approximate Gaussian kernels can also 
be obtained by iterated convolution with box filters (Wells 1986). In applications where the 
filters really need to be rotationally symmetric, carefully tuned versions of sampled Gaussians 
should be used (Freeman and Adelson 1991) (Exercise 3.11). 

The kernels we just discussed are all examples of blurring (smoothing) or low-pass ker- 
nels, since they pass through the lower frequencies while attenuating higher frequencies. How 
good are they at doing this? In Section 3.4, we use frequency-space Fourier analysis to exam- 
ine the exact frequency response of these filters. We also introduce the sinc ((sin x)/x) filter, 
which performs ideal low-pass filtering. 

In practice, smoothing kernels are often used to reduce high-frequency noise. We have 
much more to say about using variants of smoothing to remove noise later (see Sections 3.3.1, 
3.4, and as well as Chapters 4 and 5). 

Surprisingly, smoothing kernels can also be used to sharpen images using a process called 
unsharp masking. Since blurring the image reduces high frequencies, adding some of the 


difference between the original and the blurred image makes it sharper, 


Jsharp = f + Vf mi Pblur * f). (3.22) 


In fact, before the advent of digital photography, this was the standard way to sharpen images 
in the darkroom: create a blurred (“positive”) negative from the original negative by mis- 
focusing, then overlay the two negatives before printing the final image, which corresponds 


to 
Junsharp = Fa = Yhbplur * f). (3.23) 


This is no longer a linear filter but it still works well. 

Linear filtering can also be used as a pre-processing stage to edge extraction (Section 7.2) 
and interest point detection (Section 7.1) algorithms. Figure 3.14d shows a simple 3 x 3 edge 
extractor called the Sobel operator, which is a separable combination of a horizontal central 
difference (so called because the horizontal derivative is centered on the pixel) and a vertical 
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tent filter (to smooth the results). As you can see in the image below the kernel, this filter 
effectively emphasizes vertical edges. 

The simple corner detector (Figure 3.14e) looks for simultaneous horizontal and vertical 
second derivatives. As you can see, however, it responds not only to the corners of the square, 
but also along diagonal edges. Better corner detectors, or at least interest point detectors that 


are more rotationally invariant, are described in Section 7.1. 


3.2.3 Band-pass and steerable filters 


The Sobel and corner operators are simple examples of band-pass and oriented filters. More 

sophisticated kernels can be created by first smoothing the image with a (unit area) Gaussian 

filter, 

1 arty? 

G(x, y;o) = e` 2? , (3.24) 
210? 

and then taking the first or second derivatives (Marr 1982; Witkin 1983; Freeman and Adelson 


1991). Such filters are known collectively as band-pass filters, since they filter out both low 


and high frequencies. 


The (undirected) second derivative of a two-dimensional image, 


(3.25) 


is known as the Laplacian operator. Blurring an image with a Gaussian and then taking its 


Laplacian is equivalent to convolving directly with the Laplacian of Gaussian (LoG) filter, 


xv? + y 2 
E 5) G(z,y;0), (3.26) 


V?G(zx,y;0) = ( 


Oo 


which has certain nice scale-space properties (Witkin 1983; Witkin, Terzopoulos, and Kass 
1986). The five-point Laplacian is just a compact approximation to this more sophisticated 
filter. 

Likewise, the Sobel operator is a simple approximation to a directional or oriented filter, 


which can obtained by smoothing with a Gaussian (or some other filter) and then taking a 
(2) 
0a’ 


gradient field V and a unit direction û = (cos 0, sin 8), 


directional derivative Va = which is obtained by taking the dot product between the 


a-V(Gx f) = Val(G * f) = (VaG) x f. (3.27) 


The smoothed directional derivative filter, 


oG OG 
Ga = UG, + vGy = un + ET (3.28) 


128 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


Ò 
FS] 
Éi 
y 
Fal 
a 


(b) 


Figure 3.15 Second-order steerable filter (Freeman 1992) O 1992 IEEE: (a) original im- 
age of Einstein; (b) orientation map computed from the second-order oriented energy; (c) 


original image with oriented structures enhanced. 


where ú = (u,v), is an example of a steerable filter, since the value of an image convolved 
with G'a can be computed by first convolving with the pair of filters (Gz, Gy) and then steer- 
ing the filter (potentially locally) by multiplying this gradient field with a unit vector ú (Free- 
man and Adelson 1991). The advantage of this approach is that a whole family of filters can 
be evaluated with very little cost. 

How about steering a directional second derivative filter Va - VaG, which is the result 
of taking a (smoothed) directional derivative and then taking the directional derivative again? 
For example, Gyz is the second directional derivative in the x direction. 

At first glance, it would appear that the steering trick will not work, since for every di- 
rection ú, we need to compute a different first directional derivative. Somewhat surprisingly, 
Freeman and Adelson (1991) showed that, for directional Gaussian derivatives, it is possible 
to steer any order of derivative with a relatively small number of basis functions. For example, 
only three basis functions are required for the second-order directional derivative, 


Gaa = UG qq + 2uvG yy + 0 Gyy. (3.29) 


Furthermore, each of the basis filters, while not itself necessarily separable, can be computed 
using a linear combination of a small number of separable filters (Freeman and Adelson 
1991). 

This remarkable result makes it possible to construct directional derivative filters of in- 
creasingly greater directional selectivity, i.e., filters that only respond to edges that have 
strong local consistency in orientation (Figure 3.15). Furthermore, higher order steerable 
filters can respond to potentially more than a single edge orientation at a given location, and 
they can respond to both bar edges (thin lines) and the classic step edges (Figure 3.16). In 
order to do this, however, full Hilbert transform pairs need to be used for second-order and 
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Figure 3.16  Fourth-order steerable filter (Freeman and Adelson 1991) O 1991 IEEE: (a) 
test image containing bars (lines) and step edges at different orientations; (b) average ori- 
ented energy; (c) dominant orientation; (d) oriented energy as a function of angle (polar 
plot). 


higher filters, as described in (Freeman and Adelson 1991). 

Steerable filters are often used to construct both feature descriptors (Section 7.1.3) and 
edge detectors (Section 7.2). While the filters developed by Freeman and Adelson (1991) 
are best suited for detecting linear (edge-like) structures, more recent work by Koethe (2003) 
shows how a combined 2 x 2 boundary tensor can be used to encode both edge and junction 
(“corner”) features. Exercise 3.13 has you implement such steerable filters and apply them to 
finding both edge and corner features. 


Summed area table (integral image) 


If an image is going to be repeatedly convolved with different box filters (and especially filters 
of different sizes at different locations), you can precompute the summed area table (Crow 
1984), which is just the running sum of all the pixel values from the origin, 
i j 
s(i, j) = f(k, D. (3.30) 
k=0 1=0 


This can be efficiently computed using a recursive (raster-scan) algorithm, 


s(i,j) = s(i— 1,j) + s(i,j — 1) — s(i— 1,j — 1) + f (i, j). (3.31) 


The image s(i, 7) is also often called an integral image (see Figure 3.17) and can actually be 
computed using only two additions per pixel if separate row sums are used (Viola and Jones 
2004). To find the summed area (integral) inside a rectangle [io, i1] x [jo, ji], we simply 
combine four samples from the summed area table, 


S(to . .. t1, Jo os ji) = s(i1, j1) = s(i1, Jo = 1) = s(io = 1,1) + s(io = 1, jo = 1). (3.32) 
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Figure 3.17 Summed area tables: (a) original image; (b) summed area table; (c) compu- 
tation of area sum. Each value in the summed area table s(i, j) (red) is computed recursively 
from its three adjacent (blue) neighbors (3.31). Area sums S (green) are computed by com- 
bining the four values at the rectangle corners (purple) (3.32). Positive values are shown in 


bold and negative values in italics. 


A potential disadvantage of summed area tables is that they require log M + log N extra bits 
in the accumulation image compared to the original image, where M and N are the image 
width and height. Extensions of summed area tables can also be used to approximate other 
convolution kernels (Wolberg (1990, Section 6.5.2) contains a review). 

In computer vision, summed area tables have been used in face detection (Viola and 
Jones 2004) to compute simple multi-scale low-level features. Such features, which consist 
of adjacent rectangles of positive and negative values, are also known as boxlets (Simard, 
Bottou et al. 1998). In principle, summed area tables could also be used to compute the 
sums in the sum of squared differences (SSD) stereo and motion algorithms (Section 12.4). 
In practice, separable moving average filters are usually preferred (Kanade, Yoshida et al. 


1996), unless many different window shapes and sizes are being considered (Veksler 2003). 


Recursive filtering 


The incremental formula (3.31) for the summed area is an example of a recursive filter, 1.e., 
one whose values depends on previous filter outputs. In the signal processing literature, such 
filters are known as infinite impulse response (IIR), since the output of the filter to an impulse 
(single non-zero value) goes on forever. For example, for a summed area table, an impulse 
generates an infinite rectangle of 1s below and to the right of the impulse. The filters we have 
previously studied in this chapter, which involve the image with a finite extent kernel, are 
known as finite impulse response (FIR). 


Two-dimensional IIR filters and recursive formulas are sometimes used to compute quan- 
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Figure 3.18 Median and bilateral filtering: (a) original image with Gaussian noise; (b) 
Gaussian filtered; (c) median filtered; (d) bilaterally filtered; (e) original image with shot 
noise; (f) Gaussian filtered; (g) median filtered; (h) bilaterally filtered. Note that the bilat- 
eral filter fails to remove the shot noise because the noisy pixels are too different from their 


neighbors. 


tities that involve large area interactions, such as two-dimensional distance functions (Sec- 
tion 3.3.3) and connected components (Section 3.3.3). 

More commonly, however, HR filters are used inside one-dimensional separable filtering 
stages to compute large-extent smoothing kernels, such as efficient approximations to Gaus- 
sians and edge filters (Deriche 1990; Nielsen, Florack, and Deriche 1997). Pyramid-based 


algorithms (Section 3.5) can also be used to perform such large-area smoothing computations. 


3.3 More neighborhood operators 


As we have just seen, linear filters can perform a wide variety of image transformations. 
However non-linear filters, such as edge-preserving median or bilateral filters, can sometimes 
perform even better. Other examples of neighborhood operators include morphological oper- 
ators that operate on binary images, as well as semi-global operators that compute distance 


transforms and find connected components in binary images (Figure 3.11f—h). 
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Figure 3.19 Median and bilateral filtering: (a) median pixel (green); (b) selected a- 
trimmed mean pixels; (c) domain filter (numbers along edge are pixel distances); (d) range 


filter. 


3.3.1 Non-linear filtering 


The filters we have looked at so far have all been linear, i.e., their response to a sum of two 
signals is the same as the sum of the individual responses. This is equivalent to saying that 
each output pixel is a weighted summation of some number of input pixels (3.19). Linear 
filters are easier to compose and are amenable to frequency response analysis (Section 3.4). 

In many cases, however, better performance can be obtained by using a non-linear com- 
bination of neighboring pixels. Consider for example the image in Figure 3.18e, where the 
noise, rather than being Gaussian, is shot noise, 1.e., it occasionally has very large values. In 
this case, regular blurring with a Gaussian filter fails to remove the noisy pixels and instead 
turns them into softer (but still visible) spots (Figure 3.18f). 


Median filtering 


A better filter to use in this case is the median filter, which selects the median value from each 
pixel’s neighborhood (Figure 3.19a). Median values can be computed in expected linear time 
using a randomized select algorithm (Cormen 2001) and incremental variants have also been 
developed (Tomasi and Manduchi 1998; Bovik 2000, Section 3.2), as well as a constant time 
algorithm that is independent of window size (Perreault and Hébert 2007). Since the shot 
noise value usually lies well outside the true values in the neighborhood, the median filter is 
able to filter away such bad pixels (Figure 3.18g). 

One downside of the median filter, in addition to its moderate computational cost, is that 
because it selects only one input pixel value to replace each output pixel, it is not as efficient at 
averaging away regular Gaussian noise (Huber 1981; Hampel, Ronchetti et al. 1986; Stewart 
1999). A better choice may be the a-trimmed mean (Lee and Redner 1990; Crane 1997, 
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p. 109), which averages together all of the pixels except for the a fraction that are the smallest 
and the largest (Figure 3.19b). 

Another possibility is to compute a weighted median, in which each pixel is used a num- 
ber of times depending on its distance from the center. This turns out to be equivalent to 


minimizing the weighted objective function 


Y wlk, DIF k, j +1) 9(4,5)1P, (3.33) 
k,l 


where g(i, 7) is the desired output value and p = 1 for the weighted median. The value p = 2 
is the usual weighted mean, which is equivalent to correlation (3.12) after normalizing by 
the sum of the weights (Haralick and Shapiro 1992, Section 7.2.6; Bovik 2000, Section 3.2). 
The weighted mean also has deep connections to other methods in robust statistics (see Ap- 
pendix B.3), such as influence functions (Huber 1981; Hampel, Ronchetti et al. 1986). 

Non-linear smoothing has another, perhaps even more important property, especially as 
shot noise is rare in today’s cameras. Such filtering is more edge preserving, i.e., it has less 
tendency to soften edges while filtering away high-frequency noise. 

Consider the noisy image in Figure 3.18a. In order to remove most of the noise, the 
Gaussian filter is forced to smooth away high-frequency detail, which is most noticeable near 
strong edges. Median filtering does better but, as mentioned before, does not do as well at 
smoothing away from discontinuities. See Tomasi and Manduchi (1998) for some additional 
references to edge-preserving smoothing techniques. 

While we could try to use the a-trimmed mean or weighted median, these techniques still 
have a tendency to round sharp corners, since the majority of pixels in the smoothing area 


come from the background distribution. 


3.3.2 Bilateral filtering 


What if we were to combine the idea of a weighted filter kernel with a better version of outlier 
rejection? What if instead of rejecting a fixed percentage a, we simply reject (in a soft way) 
pixels whose values differ too much from the central pixel value? This is the essential idea in 
bilateral filtering, which was first popularized in the computer vision community by Tomasi 
and Manduchi (1998), although it had been proposed earlier by Aurich and Weule (1995) 
and Smith and Brady (1997). Paris, Kornprobst et al. (2008) provide a nice review of work 
in this area as well as myriad applications in computer vision, graphics, and computational 
photography. 


In the bilateral filter, the output pixel value depends on a weighted combination of neigh- 
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boring pixel values 
g(i j) = aT EU Rl) (3.34) 
l Y y wli j, k, 1) 
The weighting coefficient w(i, j, k,l) depends on the product of a domain kernel, (Fig- 


ure 3.19c), 


i — k)? j — 1)? 
d(i, j, k,l) = exp v=) +0 ) ; (3.35) 
207 
and a data-dependent range kernel (Figure 3.19d), 
f(i, j) — £(k, D|? 
r(i, j, k,l) = exp | (i, j) ( ) Jl . (3.36) 
20? 
When multiplied together, these yield the data-dependent bilateral weight function 
i — k)? j — 1)? (i, 7) — £(k, D|? 
20% 20? 


Figure 3.20 shows an example of the bilateral filtering of a noisy step edge. Note how the do- 
main kernel is the usual Gaussian, the range kernel measures appearance (intensity) similarity 
to the center pixel, and the bilateral filter kernel is a product of these two. 

Notice that for color images, the range filter (3.36) uses the vector distance between the 
center and the neighboring pixel. This is important in color images, since an edge in any one 
of the color bands signals a change in material and hence the need to downweight a pixel's 
influence.” 

Since bilateral filtering is quite slow compared to regular separable filtering, a number 
of acceleration techniques have been developed, as discussed in Durand and Dorsey (2002), 
Paris and Durand (2009), Chen, Paris, and Durand (2007), and Paris, Kornprobst et al. (2008). 
In particular, the bilateral grid (Chen, Paris, and Durand 2007), which subsamples the higher- 
dimensional color/position space on a uniform grid, continues to be widely used, including 
the application of the bilateral solver (Section 4.2.3 and Barron and Poole (2016)). An even 
faster implementation of bilateral filtering can be obtained using the permutohedral lattice 
approach developed by Adams, Baek, and Davis (2010). 


Iterated adaptive smoothing and anisotropic diffusion 


Bilateral (and other) filters can also be applied in an iterative fashion, especially if an appear- 
ance more like a “cartoon” is desired (Tomasi and Manduchi 1998). When iterated filtering 


1s applied, a much smaller neighborhood can often be used. 


7Tomasi and Manduchi (1998) show that using the vector distance (as opposed to filtering each color band 
separately) reduces color fringing effects. They also recommend taking the color difference in the more perceptually 
uniform CIELAB color space (see Section 2.3.2). 


3.3 More neighborhood operators 135 


(a) (b) (c) 

(d) (e) (5 
Figure 3.20 Bilateral filtering (Durand and Dorsey 2002) O 2002 ACM: (a) noisy step 
edge input; (b) domain filter (Gaussian); (c) range filter (similarity to center pixel value); (d) 


bilateral filter; (e) filtered step edge output; (f) 3D distance between pixels. 


Consider, for example, using only the four nearest neighbors, i.e., restricting |k — i| + |l — 
j| < 1in (3.34). Observe that 


- 1,2 - 72 
dij, 1) = exp ( U= EU 2) (3.38) 
207 
1, |k — 4] + [1 — j| = 0, 

= 4 3.39 
e 1/204, lil + |l—gl =1. een 

We can thus re-write (3.34) as 

(i, j) + O (k, D)r(i, j, k,l 

pinga, yy =f GI) +7 din FY (k, Ors, j, k,l) (3.40) 


1+9>— parli j, k,l) 
= JOm) + Dri j k DIO (k) - SOE), 
k,l 


1+79R 


where R = Y (q, r(i j, k,l), (k,l) are the Na (nearest four) neighbors of (i, j), and we 
have made the iterative nature of the filtering explicit. 
As Barash (2002) notes, (3.40) is the same as the discrete anisotropic diffusion equation 
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first proposed by Perona and Malik (1990b).9 Since its original introduction, anisotropic 
diffusion has been extended and applied to a wide range of problems (Nielsen, Florack, and 
Deriche 1997; Black, Sapiro et al. 1998; Weickert, ter Haar Romeny, and Viergever 1998; 
Weickert 1998). It has also been shown to be closely related to other adaptive smoothing 
techniques (Saint-Marc, Chen, and Medioni 1991; Barash 2002; Barash and Comaniciu 2004) 
as well as Bayesian regularization with a non-linear smoothness term that can be derived from 
image statistics (Scharr, Black, and Haussecker 2003). 

In its general form, the range kernel r(2, j, k,l) = r(|| f (i, 7) — f(k, DI|), which is usually 
called the gain or edge-stopping function, or diffusion coefficient, can be any monotoni- 
cally increasing function with r’(x) — 0 as x — oo. Black, Sapiro et al. (1998) show 
how anisotropic diffusion is equivalent to minimizing a robust penalty function on the image 
gradients, which we discuss in Sections 4.2 and 4.3. Scharr, Black, and Haussecker (2003) 
show how the edge-stopping function can be derived in a principled manner from local image 
statistics. They also extend the diffusion neighborhood from M4 to Ng, which allows them 
to create a diffusion operator that is both rotationally invariant and incorporates information 
about the eigenvalues of the local structure tensor. 

Note that, without a bias term towards the original image, anisotropic diffusion and itera- 
tive adaptive smoothing converge to a constant image. Unless a small number of iterations is 
used (e.g., for speed), it is usually preferable to formulate the smoothing problem as a joint 
minimization of a smoothness term and a data fidelity term, as discussed in Sections 4.2 and 
4.3 and by Scharr, Black, and Haussecker (2003), which introduce such a bias in a principled 


manner. 


Guided image filtering 


While so far we have discussed techniques for filtering an image to obtain an improved ver- 
sion, e.g., one with less noise or sharper edges, it is also possible to use a different guide 
image to adaptively filter a noisy input (Eisemann and Durand 2004; Petschnigg, Agrawala 
et al. 2004; He, Sun, and Tang 2013). An example of this is using a flash image, which has 
strong edges but poor color, to adaptively filter a low-light non-flash color image, which has 
large amounts of noise, as described in Section 10.2.2. In their papers, where they apply the 
range filter (3.36) to a different guide image h(), Eisemann and Durand (2004) call their ap- 
proach a cross-bilateral filter, while Petschnigg, Agrawala et al. (2004) call it joint bilateral 
filtering. 

He, Sun, and Tang (2013) point out that these papers are just two examples of the more 
general concept of guided image filtering, where the guide image h() is used to compute the 


8The 1/(1 + nR) factor is not present in anisotropic diffusion but becomes negligible as y — 0. 
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Figure 3.21 Guided image filtering (He, Sun, and Tang 2013) © 2013 IEEE. Unlike joint 
bilateral filtering, shown on the left, which computes a per pixel weight mask from the guide 
image (shown as I in the figure, but h in the text), the guided image filter models the output 
value (shown as q; in the figure, but denoted as g(i, j) in the text) as a local affine transfor- 
mation of the guide pixels. 


locally adapted inter-pixel weights w(i, j, k, l), i.e., 


gli, j) = Y w(h; i, j, k, DE(k, 1). (3.41) 
k,l 

In their paper, the authors suggest modeling the relationship between the guide and input 
images using a local affine transformation, 


gli, j) = Ag ıh(i, j) + bk, (3.42) 


where the estimates for A; and bẹ are obtained from a regularized least squares fit over a 


square neighborhood centered around pixel (k, 1), i.e., minimizing 


XO l|Agua(i,7) + bra — EC, DI + AJAI- (3.43) 
(5, EN 4.1 
These kinds of regularized least squares problems are called ridge regression (Section 4.1). 
The concept behind this algorithm is illustrated in Figure 3.21. 

Instead of just taking the predicted value of the filtered pixel g(i, j) from the window cen- 
tered on that pixel, an average across all windows that cover the pixel is used. The resulting 
algorithm (He, Sun, and Tang 2013, Algorithm 1) consists of a series of local mean image and 
image moment filters, a per-pixel linear system solve (which reduces to a division if the guide 
image is scalar), and another set of filtering steps. The authors describe how this fast and 
simple process has been applied to a wide variety of computer vision problems, including 
image matting (Section 10.4.3), high dynamic range image tone mapping (Section 10.2.1), 
stereo matching (Hosni, Rhemann ef al. 2013), and image denoising. 
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Figure 3.22 Binary image morphology: (a) original image; (b) dilation; (c) erosion; (d) 
majority; (e) opening; (f) closing. The structuring element for all examples is a 5 x 5 square. 
The effects of majority are a subtle rounding of sharp corners. Opening fails to eliminate the 


dot, as it is not wide enough. 


3.3.3 Binary image processing 


While non-linear filters are often used to enhance grayscale and color images, they are also 
used extensively to process binary images. Such images often occur after a thresholding 
operation, 
OCF, t) -f a (3.44) 
0 else, 
e.g., converting a scanned grayscale document into a binary image for further processing, 


such as optical character recognition. 


Morphology 


The most common binary image operations are called morphological operations, because 
they change the shape of the underlying binary objects (Ritter and Wilson 2000, Chapter 7). 
To perform such an operation, we first convolve the binary image with a binary structuring 
element and then select a binary output value depending on the thresholded result of the 
convolution. (This is not the usual way in which these operations are described, but I find it 
a nice simple way to unify the processes.) The structuring element can be any shape, from 
a simple 3 x 3 box filter, to more complicated disc structures. It can even correspond to a 
particular shape that is being sought for in the image. 

Figure 3.22 shows a close-up of the convolution of a binary image f with a3 x 3 struc- 


turing element s and the resulting images for the operations described below. Let 
c=f8s (3.45) 


be the integer-valued count of the number of 1s inside each structuring element as it is scanned 
over the image and S be the size of the structuring element (number of pixels). The standard 


operations used in binary morphology include: 
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e dilation: dilate(f, s) = 0(c, 1); 

e erosion: erode( f, s) = O(c, S); 

* majority: maj(f, s) = 0(c, S/2); 

* opening: open(f, s) = dilate(erode(f, s), s); 
e closing: close( f, s) = erode(dilate(f, s), s). 


As we can see from Figure 3.22, dilation grows (thickens) objects consisting of 1s, while 
erosion shrinks (thins) them. The opening and closing operations tend to leave large regions 
and smooth boundaries unaffected, while removing small objects or holes and smoothing 
boundaries. 

While we will not use mathematical morphology much in the rest of this book, it is a 
handy tool to have around whenever you need to clean up some thresholded images. You 
can find additional details on morphology in other textbooks on computer vision and image 
processing (Haralick and Shapiro 1992, Section 5.2; Bovik 2000, Section 2.2; Ritter and 
Wilson 2000, Section 7) as well as articles and books specifically on this topic (Serra 1982; 
Serra and Vincent 1992; Yuille, Vincent, and Geiger 1992; Soille 2006). 


Distance transforms 


The distance transform is useful in quickly precomputing the distance to a curve or set of 
points using a two-pass raster algorithm (Rosenfeld and Pfaltz 1966; Danielsson 1980; Borge- 
fors 1986; Paglieroni 1992; Breu, Gil et al. 1995; Felzenszwalb and Huttenlocher 2012; 
Fabbri, Costa et al. 2008). It has many applications, including level sets (Section 7.3.2), 
fast chamfer matching (binary image alignment) (Huttenlocher, Klanderman, and Rucklidge 
1993), feathering in image stitching and blending (Section 8.4.2), and nearest point alignment 
(Section 13.2.1). 

The distance transform D(i, j) of a binary image b(i, 7) is defined as follows. Let d(k, 1) 
be some distance metric between pixel offsets. Two commonly used metrics include the city 
block or Manhattan distance 

dy (k,l) = |k| + |l] (3.46) 


and the Euclidean distance 
də(k, l) = V k? + l. (3.47) 


The distance transform is then defined as 


D(i,j) = in d(i—-k,j—l 3.48 
(i,j) a a (i—k,j— l), (3.48) 
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Figure 3.23 City block distance transform: (a) original binary image; (b) top to bottom 
(forward) raster sweep: green values are used to compute the orange value; (c) bottom to top 
(backward) raster sweep: green values are merged with old orange value; (d) final distance 


transform. 


i.e., itis the distance to the nearest background pixel whose value is 0. 

The D; city block distance transform can be efficiently computed using a forward and 
backward pass of a simple raster-scan algorithm, as shown in Figure 3.23. During the forward 
pass, each non-zero pixel in b is replaced by the minimum of 1 + the distance of its north or 
west neighbor. During the backward pass, the same occurs, except that the minimum is both 
over the current value D and 1 + the distance of the south and east neighbors (Figure 3.23). 

Efficiently computing the Euclidean distance transform is more complicated (Danielsson 
1980; Borgefors 1986). Here, just keeping the minimum scalar distance to the boundary 
during the two passes is not sufficient. Instead, a vector-valued distance consisting of both 
the x and y coordinates of the distance to the boundary must be kept and compared using the 
squared distance (hypotenuse) rule. As well, larger search regions need to be used to obtain 
reasonable results. 

Figure 3.11g shows a distance transform computed from a binary image. Notice how 
the values grow away from the black (ink) regions and form ridges in the white area of the 
original image. Because of this linear growth from the starting boundary pixels, the distance 
transform is also sometimes known as the grassfire transform, since it describes the time at 
which a fire starting inside the black region would consume any given pixel, or a chamfer, 
because it resembles similar shapes used in woodworking and industrial design. The ridges 
in the distance transform become the skeleton (or medial axis transform (MAT)) of the region 
where the transform is computed, and consist of pixels that are of equal distance to two (or 
more) boundaries (Tek and Kimia 2003; Sebastian and Kimia 2005). 

A useful extension of the basic distance transform is the signed distance transform, which 
computes distances to boundary pixels for all the pixels (Lavallée and Szeliski 1995). The 


simplest way to create this is to compute the distance transforms for both the original binary 
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image and its complement and to negate one of them before combining. Because such dis- 
tance fields tend to be smooth, it is possible to store them more compactly (with minimal loss 
in relative accuracy) using a spline defined over a quadtree or octree data structure (Lavallée 
and Szeliski 1995; Szeliski and Lavallée 1996; Frisken, Perry ef al. 2000). Such precom- 
puted signed distance transforms can be extremely useful in efficiently aligning and merging 
2D curves and 3D surfaces (Huttenlocher, Klanderman, and Rucklidge 1993; Szeliski and 
Lavallée 1996; Curless and Levoy 1996), especially if the vectorial version of the distance 
transform, i.e., a pointer from each pixel or voxel to the nearest boundary or surface element, 
is stored and interpolated. Signed distance fields are also an essential component of level set 


evolution (Section 7.3.2), where they are called characteristic functions. 


Connected components 


Another useful semi-global image operation is finding connected components, which are de- 
fined as regions of adjacent pixels that have the same input value or label. Pixels are said 
to be M4 adjacent if they are immediately horizontally or vertically adjacent, and M3 if they 
can also be diagonally adjacent. Both variants of connected components are widely used in 
a variety of applications, such as finding individual letters in a scanned document or finding 
objects (say, cells) in a thresholded image and computing their area statistics. Over the years, 
a wide variety of efficient algorithms have been developed to find such components, includ- 
ing the ones described in Haralick and Shapiro (1992, Section 2.3) and He, Ren et al. (2017). 
Such algorithms are usually included in image processing libraries such as OpenCV. 

Once a binary or multi-valued image has been segmented into its connected components, 
it is often useful to compute the area statistics for each individual region R. Such statistics 


include: 
e the area (number of pixels); 
e the perimeter (number of boundary pixels); 
e the centroid (average x and y values); 


e the second moments, 


M = a [ez y-al, (3.49) 


(12,y) ER k =y 


from which the major and minor axis orientation and lengths can be computed using 


eigenvalue analysis. 
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These statistics can then be used for further processing, e.g., for sorting the regions by the area 
size (to consider the largest regions first) or for preliminary matching of regions in different 


images. 


3.4 Fourier transforms 


In Section 3.2, we mentioned that Fourier analysis could be used to analyze the frequency 
characteristics of various filters. In this section, we explain both how Fourier analysis lets 
us determine these characteristics (i.e., the frequency content of an image) and how using 
the Fast Fourier Transform (FFT) lets us perform large-kernel convolutions in time that is 
independent of the kernel’s size. More comprehensive introductions to Fourier transforms 
are provided by Bracewell (1986), Glassner (1995), Oppenheim and Schafer (1996), and 
Oppenheim, Schafer, and Buck (1999). 

How can we analyze what a given filter does to high, medium, and low frequencies? The 
answer is to simply pass a sinusoid of known frequency through the filter and to observe by 
how much it is attenuated. Let 


s(x) =sin(27 fx + P;) = sin(wz + ¢;) (3.50) 


be the input sinusoid whose frequency is f, angular frequency is w = 2r f, and phase is Qi. 
Note that in this section, we use the variables x and y to denote the spatial coordinates of an 
image, rather than 2 and j as in the previous sections. This is both because the letters 2 and 7 
are used for the imaginary number (the usage depends on whether you are reading complex 
variables or electrical engineering literature) and because it is clearer how to distinguish the 
horizontal (x) and vertical (y) components in frequency space. In this section, we use the 
letter 7 for the imaginary number, since that is the form more commonly found in the signal 
processing literature (Bracewell 1986; Oppenheim and Schafer 1996; Oppenheim, Schafer, 
and Buck 1999). 

If we convolve the sinusoidal signal s(x) with a filter whose impulse response is h(x), 


we get another sinusoid of the same frequency but different magnitude A and phase ġo, 
o(x) = h(a) x s(x) = Asin(wz + do), (3.51) 


as shown in Figure 3.24. To see that this is the case, remember that a convolution can be 
expressed as a weighted summation of shifted input signals (3.14) and that the summation of 


a bunch of shifted sinusoids of the same frequency is just a single sinusoid at that frequency.” 


91f h is a general (non-linear) transform, additional harmonic frequencies are introduced. This was traditionally 
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s(x) 


Figure 3.24 The Fourier Transform as the response of a filter h(x) to an input sinusoid 
s(x) = ed*2 yielding an output sinusoid o(x) = h(x) * s(x) = Act(e2+0), 


The new magnitude A is called the gain or magnitude of the filter, while the phase difference 
Ad = do — 9; is called the shift or phase. 


In fact, a more compact notation is to use the complex-valued sinusoid 
s(x) = e!”* = coswa + j sin wz. (3.52) 
In that case, we can simply write, 
o(x) = h(x) x s(x) = Arto), (3.53) 


The Fourier transform is simply a tabulation of the magnitude and phase response at each 
frequency, 
Hw) = F {h(x)} = Ae??, (3.54) 


i.e., it is the response to a complex sinusoid of frequency w passed through the filter h(x). 
The Fourier transform pair is also often written as 


F 
h(x) > H(w). (3.55) 
Unfortunately, (3.54) does not give an actual formula for computing the Fourier transform. 
Instead, it gives a recipe, i.e., convolve the filter with a sinusoid, observe the magnitude and 
phase shift, repeat. Fortunately, closed form equations for the Fourier transform exist both in 


the continuous domain, 


H(w) = I Ñ h(x)e 4% da, (3.56) 


the bane of audiophiles, who insisted on equipment with no harmonic distortion. Now that digital audio has intro- 
duced pure distortion-free sound, some audiophiles are buying retro tube amplifiers or digital signal processors that 


simulate such distortions because of their “warmer sound”. 
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and in the discrete domain, 


1 N-1 E 
H(k) = Viger, (3.57) 
x=0 


where N is the length of the signal or region of analysis. These formulas apply both to filters, 
such as h(x), and to signals or images, such as s(x) or g(x). 
The discrete form of the Fourier transform (3.57) is known as the Discrete Fourier Trans- 


form (DFT). Note that while (3.57) can be evaluated for any value of k, it only makes sense 

N N 
~Ar 
frequencies and hence provide no additional information, as explained in the discussion on 


for values in the range k € [ ]. This is because larger values of k alias with lower 
aliasing in Section 2.3.1. 

At face value, the DFT takes O(N?) operations (multiply-adds) to evaluate. Fortunately, 
there exists a faster algorithm called the Fast Fourier Transform (FFT), which requires only 
O(N log, N) operations (Bracewell 1986; Oppenheim, Schafer, and Buck 1999). We do 
not explain the details of the algorithm here, except to say that it involves a series of log, N 
stages, where each stage performs small 2 x 2 transforms (matrix multiplications with known 
coefficients) followed by some semi-global permutations. (You will often see the term but- 
terfly applied to these stages because of the pictorial shape of the signal processing graphs 
involved.) Implementations for the FFT can be found in most numerical and signal processing 
libraries. 

The Fourier transform comes with a set of extremely useful properties relating original 
signals and their Fourier transforms, including superposition, shifting, reversal, convolution, 
correlation, multiplication, differentiation, domain scaling (stretching), and energy preserva- 
tion (Parseval’s Theorem). To make room for all of the new material in this second edition, 
I have removed all of these details, as well as a discussion of commonly used Fourier trans- 
form pairs. Interested readers should refer to (Szeliski 2010, Section 3.1, Tables 3.1-3.3) or 
standard textbooks on signal processing and Fourier transforms (Bracewell 1986; Glassner 
1995; Oppenheim and Schafer 1996; Oppenheim, Schafer, and Buck 1999). 

We can also compute the Fourier transforms for the small discrete kernels shown in Fig- 
ure 3.14 (see Table 3.1). Notice how the moving average filters do not uniformly dampen 
higher frequencies and hence can lead to ringing artifacts. The binomial filter (Gomes and 
Velho 1997) used as the “Gaussian” in Burt and Adelson’s (1983a) Laplacian pyramid (see 
Section 3.5), does a decent job of separating the high and low frequencies, but still leaves 
a fair amount of high-frequency detail, which can lead to aliasing after downsampling. The 
Sobel edge detector at first linearly accentuates frequencies, but then decays at higher fre- 


quencies, and hence has trouble detecting fine-scale edges, e.g., adjacent black and white 
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Name Kernel Transform Plot 
i 1 ‘a 
box-3 g| 1) 4) 2 3 (1 + 2cosw) o 
1 la a 
box-5 ae oe ae ae: g(1+2cosw+2cos2w) « 
1 1 i 
linear 4[1/2|1 3 (1 + cosw) ‘a 
dl 1 : is 
binomial 16 1j4j6j4j1 1 (1 + cos) 02 
1 si 
Sobel 3/-1/0/1 sinw 0 
1| —1]2]|-—1 | 1 (1 — cosw) de 
corner 2 2 02 


Table 3.1 Fourier transforms of the separable kernels shown in Figure 3.14, obtained by 
evaluating Y, h(k)e 9. 
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columns. We look at additional examples of small kernel Fourier transforms in Section 3.5.2, 


where we study better kernels for prefiltering before decimation (size reduction). 


3.4.1 Two-dimensional Fourier transforms 


The formulas and insights we have developed for one-dimensional signals and their trans- 
forms translate directly to two-dimensional images. Here, instead of just specifying a hor- 
izontal or vertical frequency w, or wy, we can create an oriented sinusoid of frequency 
(Wy, Wy); 


s(x, y) = sin(w,z + wyy). (3.58) 


The corresponding two-dimensional Fourier transforms are then 


H (wy, Wy) = fw (x, ye Jet tu) de dy, (3.59) 


and in the discrete domain, 


M-1N-1 


ko, k h( —j2r(kex/M+kyy/N) ; 
H (kes ky) = MN 2 2: (2,y)e (3.60) 
where M and N are the width and height of the image. 
All of the Fourier transform properties from 1D carry over to two dimensions if we re- 
place the scalar variables x, w, xo and a, with their 2D vector counterparts x = (x,y), 
w = (Wz,Wy), Xo = (to, Yo), and a = (ay, ay), and use vector inner products instead of 


multiplications. 


Wiener filtering 


While the Fourier transform is a useful tool for analyzing the frequency characteristics of a 
filter kernel or image, it can also be used to analyze the frequency spectrum of a whole class 
of images. 

A simple model for images is to assume that they are random noise fields whose expected 


magnitude at each frequency is given by this power spectrum Ps (Wz, Wy), i.e., 
([S (We, wy)? = P.(we, wy), (3.61) 


where the angle brackets (-) denote the expected (mean) value of a random variable.!% To 
generate such an image, we simply create a random Gaussian noise image S (wz, wy) where 


each “pixel” is a zero-mean Gaussian of variance P, (wg, wy) and then take its inverse FFT. 


10The notation El[-] is also commonly used. 
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Figure 3.25 Discrete cosine transform (DCT) basis functions: The first DC (i.e., constant) 
basis is the horizontal blue line, the second is the brown half-cycle waveform, etc. These 


bases are widely used in image and video compression standards such as JPEG. 


The observation that signal spectra capture a first-order description of spatial statistics 
is widely used in signal and image processing. In particular, assuming that an image is a 
sample from a correlated Gaussian random noise field combined with a statistical model of 
the measurement process yields an optimum restoration filter known as the Wiener filter. 

The first edition of this book contains a derivation of the Wiener filter (Szeliski 2010, 
Section 3.4.3), but I’ve decided to remove this from the current edition, since it is almost 
never used in practice any more, having been replaced with better-performing non-linear 
filters. 


Discrete cosine transform 


The discrete cosine transform (DCT) is a variant of the Fourier transform particularly well- 
suited to compressing images in a block-wise fashion. The one-dimensional DCT is com- 
puted by taking the dot product of each N-wide block of pixels with a set of cosines of 
different frequencies, 
N-1 m 1 
F(k) = a. cos (a + 91) fÒ, (3.62) 
where k is the coefficient (frequency) index and the 1/2-pixel offset is used to make the basis 
coefficients symmetric (Wallace 1991). Some of the discrete cosine basis functions are shown 
in Figure 3.25. As you can see, the first basis function (the straight blue line) encodes the 
average DC value in the block of pixels, while the second encodes a slightly curvy version of 
the slope. 
It turns out that the DCT is a good approximation to the optimal Karhunen—Loéve decom- 
position of natural image statistics over small patches, which can be obtained by performing 
a principal component analysis (PCA) of images, as described in Section 5.2.3. The KL- 
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transform decorrelates the signal optimally (assuming the signal is described by its spectrum) 
and thus, theoretically, leads to optimal compression. 


The two-dimensional version of the DCT is defined similarly, 


N-1N-1 
F(k,l) = 2 2 cos (50 + 5) cos (56 E Y) fä, j). (3.63) 
Like the 2D Fast Fourier Transform, the 2D DCT can be implemented separably, i.e., first 
computing the DCT of each line in the block and then computing the DCT of each resulting 
column. Like the FFT, each of the DCTs can also be computed in O(N log N) time. 

As we mentioned in Section 2.3.3, the DCT is widely used in today’s image and video 
compression algorithms, although alternatives such as wavelet transforms (Simoncelli and 
Adelson 1990b; Taubman and Marcellin 2002), discussed in Section 3.5.4, and overlapped 
variants of the DCT (Malvar 1990, 1998, 2000), are used in the JPEG2000 and JPEG XR stan- 
dards. These newer algorithms suffer less from the blocking artifacts (visible edge-aligned 
discontinuities) that result from the pixels in each block (typically 8 x 8) being transformed 
and quantized independently. See Exercise 4.3 for ideas on how to remove blocking artifacts 


from compressed JPEG images. 


3.4.2 Application: Sharpening, blur, and noise removal 


Another common application of image processing is the enhancement of images through the 
use of sharpening and noise removal operations, which require some kind of neighborhood 
processing. Traditionally, these kinds of operations were performed using linear filtering (see 
Sections 3.2 and Section 3.4.1). Today, it is more common to use non-linear filters (Sec- 
tion 3.3.1), such as the weighted median or bilateral filter (3.34-3.37), anisotropic diffusion 
(3.39-3.40), or non-local means (Buades, Coll, and Morel 2008). Variational methods (Sec- 
tion 4.2), especially those using non-quadratic (robust) norms such as the L norm (which is 
called total variation), are also often used. Most recently, deep neural networks have taken 
over the denoising community (Section 10.3). Figure 3.19 shows some examples of linear 
and non-linear filters being used to remove noise. 

When measuring the effectiveness of image denoising algorithms, it is common to report 
the results as a peak signal-to-noise ratio (PSNR) measurement (2.120), where I(x) is the 
original (noise-free) image and I (x) is the image after denoising; this is for the case where 
the noisy image has been synthetically generated, so that the clean image is known. A bet- 
ter way to measure the quality is to use a perceptually based similarity metric, such as the 
structural similarity (SSIM) index (Wang, Bovik et al. 2004; Wang, Bovik, and Simoncelli 
2005) or FLIP image difference evaluator (Andersson, Nilsson et al. 2020). More recently, 
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people have started measuring similarity using neural “perceptual” similarity metrics (John- 
son, Alahi, and Fei-Fei 2016; Dosovitskiy and Brox 2016; Zhang, Isola et al. 2018; Tariq, 
Tursun et al. 2020; Czolbe, Krause et al. 2020), which, unlike Lə (PSNR) or Lı metrics, 
which encourage smooth or flat average results, prefer images with similar amounts of tex- 
ture (Cho, Joshi et al. 2012). When the clean image is not available, it is also possible to 
assess the quality of an image using no-reference image quality assessment (Mittal, Moorthy, 
and Bovik 2012; Talebi and Milanfar 2018). 

Exercises 3.12, 3.21, and 3.28 have you implement some of these operations and compare 
their effectiveness. More sophisticated techniques for blur removal and the related task of 


super-resolution are discussed in Section 10.3. 


3.5 Pyramids and wavelets 


So far in this chapter, all of the image transformations we have studied produce output images 
of the same size as the inputs. Often, however, we may wish to change the resolution of an 
image before proceeding further. For example, we may need to interpolate a small image to 
make its resolution match that of the output printer or computer screen. Alternatively, we 
may want to reduce the size of an image to speed up the execution of an algorithm or to save 
on storage space or transmission time. 

Sometimes, we do not even know what the appropriate resolution for the image should 
be. Consider, for example, the task of finding a face in an image (Section 6.3.1). Since we 
do not know the scale at which the face will appear, we need to generate a whole pyramid 
of differently sized images and scan each one for possible faces. (Biological visual systems 
also operate on a hierarchy of scales (Marr 1982).) Such a pyramid can also be very helpful 
in accelerating the search for an object by first finding a smaller instance of that object at a 
coarser level of the pyramid and then looking for the full resolution object only in the vicinity 
of coarse-level detections (Section 9.1.1). Finally, image pyramids are extremely useful for 
performing multi-scale editing operations such as blending images while maintaining details. 

In this section, we first discuss good filters for changing image resolution, i.e., upsampling 
(interpolation, Section 3.5.1) and downsampling (decimation, Section 3.5.2). We then present 
the concept of multi-resolution pyramids, which can be used to create a complete hierarchy 
of differently sized images and to enable a variety of applications (Section 3.5.3). A closely 
related concept is that of wavelets, which are a special kind of pyramid with higher frequency 
selectivity and other useful properties (Section 3.5.4). Finally, we present a useful application 
of pyramids, namely the blending of different images in a way that hides the seams between 


the image boundaries (Section 3.5.5). 
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Figure 3.26 Signal interpolation, g(i) = >>, f(k)h(i — rk): (a) weighted summation of 
input values; (b) polyphase filter interpretation. 


3.5.1 Interpolation 


In order to interpolate (or upsample) an image to a higher resolution, we need to select some 


interpolation kernel with which to convolve the image, 


gli, j) => £(,1)h(i— rk, j — rl). (3.64) 
kl 


This formula is related to the discrete convolution formula (3.14), except that we replace k 
and lin h() with rk and rl, where r is the upsampling rate. Figure 3.26a shows how to think 
of this process as the superposition of sample weighted interpolation kernels, one centered 
at each input sample k. An alternative mental model is shown in Figure 3.26b, where the 
kernel is centered at the output pixel value 2 (the two forms are equivalent). The latter form 
is sometimes called the polyphase filter form, since the kernel values h(i) can be stored as r 
separate kernels, each of which is selected for convolution with the input samples depending 
on the phase of i relative to the upsampled grid. 

What kinds of kernel make good interpolators? The answer depends on the application 
and the computation time involved. Any of the smoothing kernels shown in Table 3.1 can be 
used after appropriate re-scaling.'! The linear interpolator (corresponding to the tent kernel) 
produces interpolating piecewise linear curves, which result in unappealing creases when 
applied to images (Figure 3.27a). The cubic B-spline, whose discrete 1/2-pixel sampling 
appears as the binomial kernel in Table 3.1, is an approximating kernel (the interpolated 
image does not pass through the input data points) that produces soft images with reduced 
high-frequency detail. The equation for the cubic B-spline is easiest to derive by convolving 
the tent function (linear B-spline) with itself. 


11 The smoothing kernels in Table 3.1 have a unit area. To turn them into interpolating kernels, we simply scale 
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Figure 3.27 Two-dimensional image interpolation: (a) bilinear; (b) bicubic (a = —1); (c) 
bicubic (a = —0.5); (d) windowed sinc (nine taps). 


While most graphics cards use the bilinear kernel (optionally combined with a MIP- 
map—see Section 3.5.3), most photo editing packages use bicubic interpolation. The cu- 
bic interpolant is a C! (derivative-continuous) piecewise-cubic spline (the term “spline” is 


synonymous with “piecewise-polynomial”)'? whose equation is 


1— (a +3)x? + (a+ 2)|z|3 if |e|<1 
ha) = ¢ alele ici es (3.65) 


0 otherwise, 


where a specifies the derivative at x = 1 (Parker, Kenyon, and Troxel 1983). The value of 
a is often set to —1, since this best matches the frequency characteristics of a sinc function 
(Figure 3.28). It also introduces a small amount of sharpening, which can be visually appeal- 
ing. Unfortunately, this choice does not linearly interpolate straight lines (intensity ramps), 
so some visible ringing may occur. A better choice for large amounts of interpolation is prob- 
ably a = —0.5, which produces a quadratic reproducing spline; it interpolates linear and 


quadratic functions exactly (Wolberg 1990, Section 5.4.3). Figure 3.28 shows the a = —1 


them up by the interpolation rate r. 
The term “spline” comes from the draughtsman’s workshop, where it was the name of a flexible piece of wood 


or metal used to draw smooth curves. 


152 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


(a) (b) 


Figure 3.28 (a) Some windowed sinc functions and (b) their log Fourier transforms: 
raised-cosine windowed sinc in blue, cubic interpolators (a = —1 and a = —0.5) in green 
and purple, and tent function in brown. They are often used to perform high-accuracy low- 
pass filtering operations. 


and a = —0.5 cubic interpolating kernel along with their Fourier transforms; Figure 3.27b 
and c shows them being applied to two-dimensional interpolation. 

Splines have long been used for function and data value interpolation because of the abil- 
ity to precisely specify derivatives at control points and efficient incremental algorithms for 
their evaluation (Bartels, Beatty, and Barsky 1987; Farin 1992, 2002). Splines are widely used 
in geometric modeling and computer-aided design (CAD) applications, although they have 
started being displaced by subdivision surfaces (Zorin, Schróder, and Sweldens 1996; Peters 
and Reif 2008). In computer vision, splines are often used for elastic image deformations 
(Section 3.6.2), scattered data interpolation (Section 4.1), motion estimation (Section 9.2.2), 
and surface interpolation (Section 13.3). In fact, itis possible to carry out most image process- 
ing operations by representing images as splines and manipulating them in a multi-resolution 
framework (Unser 1999; Nehab and Hoppe 2014). 

The highest quality interpolator is generally believed to be the windowed sinc function 
because it both preserves details in the lower resolution image and avoids aliasing. (It is also 
possible to construct a C! piecewise-cubic approximation to the windowed sinc by matching 
its derivatives at zero crossing (Szeliski and Ito 1986).) However, some people object to the 
excessive ringing that can be introduced by the windowed sinc and to the repetitive nature 
of the ringing frequencies (see Figure 3.27d). For this reason, some photographers prefer 
to repeatedly interpolate images by a small fractional amount (this tends to decorrelate the 
original pixel grid with the final image). Additional possibilities include using the bilateral 


filter as an interpolator (Kopf, Cohen et al. 2007), using global optimization (Section 3.6) or 
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Figure 3.29 Signal decimation: (a) the original samples are (b) convolved with a low-pass 
filter before being downsampled. 


hallucinating details (Section 10.3). 


3.5.2 Decimation 


While interpolation can be used to increase the resolution of an image, decimation (downsam- 
pling) is required to reduce the resolution.!? To perform decimation, we first (conceptually) 
convolve the image with a low-pass filter (to avoid aliasing) and then keep every rth sample. 


In practice, we usually only evaluate the convolution at every rth sample, 


glij) =X £(k,D)h(ri — k, rj — D), (3.66) 
k,l 


as shown in Figure 3.29. Note that the smoothing kernel h(k,1), in this case, is often a 


stretched and re-scaled version of an interpolation kernel. Alternatively, we can write 
1 
ij) =- k,Dh(i—k/r,j=1 3.67 
gli, j) >D h(i — k/r,j — l/r) (3.67) 


and keep the same kernel h(k, 1) for both interpolation and decimation. 

One commonly used (r = 2) decimation filter is the binomial filter introduced by Burt 
and Adelson (1983a). As shown in Table 3.1, this kernel does a decent job of separating 
the high and low frequencies, but still leaves a fair amount of high-frequency detail, which 
can lead to aliasing after downsampling. However, for applications such as image blending 
(discussed later in this section), this aliasing is of little concern. 

If, however, the downsampled images will be displayed directly to the user or, perhaps, 


blended with other resolutions (as in MIP-mapping, Section 3.5.3), a higher-quality filter is 


13The term “decimation” has a gruesome etymology relating to the practice of killing every tenth soldier in a 
Roman unit guilty of cowardice. It is generally used in signal processing to mean any downsampling or rate reduction 


operation. 
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desired. For high downsampling rates, the windowed sinc prefilter is a good choice (Fig- 
ure 3.28). However, for small downsampling rates, e.g., r = 2, more careful filter design is 
required. 

Table 3.2 shows a number of commonly used r = 2 downsampling filters, while Fig- 


ure 3.30 shows their corresponding frequency responses. These filters include: 


e the linear [1, 2, 1] filter gives a relatively poor response; 


the binomial [1, 4, 6, 4, 1] filter cuts off a lot of frequencies but is useful for computer 


vision analysis pyramids; 


the cubic filters from (3.65); the a = —1 filter has a sharper fall-off than the a = —0.5 
filter (Figure 3.30); 


a cosine-windowed sinc function; 


the QMF-9 filter of Simoncelli and Adelson (1990b) is used for wavelet denoising and 
aliases a fair amount (note that the original filter coefficients are normalized to v2 gain 


so they can be “self-inverting”); 


the 9/7 analysis filter from JPEG 2000 (Taubman and Marcellin 2002). 


Please see the original papers for the full-precision values of some of these coefficients. 


3.5.3 Multi-resolution representations 


Now that we have described interpolation and decimation algorithms, we can build a complete 


image pyramid (Figure 3.31). As we mentioned before, pyramids can be used to accelerate 


Cubic Cubic Windowed JPEG 

|n| | Linear Binomial a=-1 a= —0.5 sinc QMF-9 2000 

0 0.50 0.3750 0.5000 0.50000 0.4939 0.5638 0.6029 

1 0.25 0.2500 0.3125 0.28125 0.2684 0.2932 0.2669 

2 0.0625 0.0000 0.00000 0.0000 -0.0519 -0.0782 

3 -0.0625 -0.03125 -0.0153 -0.0431 -0.0169 
4 


0.0000 0.0198 0.0267 


Table 3.2 Filter coefficients for 2 x decimation. These filters are of odd length, are sym- 
metric, and are normalized to have unit DC gain (sum up to 1). See Figure 3.30 for their 


associated frequency responses. 
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Figure 3.30 Frequency response for some 2 x decimation filters. The cubic a = —1 filter 


has the sharpest fall-off but also a bit of ringing; the wavelet analysis filters (QMF-9 and 
JPEG 2000), while useful for compression, have more aliasing. 


coarse-to-fine search algorithms, to look for objects or patterns at different scales, and to per- 
form multi-resolution blending operations. They are also widely used in computer graphics 
hardware and software to perform fractional-level decimation using the MIP-map, which we 
discuss in Section 3.6. 

The best known (and probably most widely used) pyramid in computer vision is Burt and 
Adelson’s (1983a) Laplacian pyramid. To construct the pyramid, we first blur and subsample 
the original image by a factor of two and store this in the next level of the pyramid (Fig- 
ures 3.31 and 3.32). Because adjacent levels in the pyramid are related by a sampling rate 
r = 2, this kind of pyramid is known as an octave pyramid. Burt and Adelson originally 


proposed a five-tap kernel of the form 


ciblalb cl, (3.68) 


with b = 1/4 and c = 1/4— a/2. In practice, they and everyone else uses a = 3/8, which 
results in the familiar binomial kernel, 


1 


TA A cal es 


> (3.69) 


which is particularly easy to implement using shifts and adds. (This was important in the days 


when multipliers were expensive.) The reason they call their resulting pyramid a Gaussian 


pyramid is that repeated convolutions of the binomial kernel converge to a Gaussian. !4 


l4Then again, this is true for any smoothing kernel (Wells 1986). 
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Figure 3.31 A traditional image pyramid: each level has half the resolution (width and 
height), and hence a quarter of the pixels, of its parent level. 
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Figure 3.32 The Gaussian pyramid shown as a signal processing diagram: The (a) anal- 
ysis and (b) re-synthesis stages are shown as using similar computations. The white circles 
indicate zero values inserted by the + 2 upsampling operation. Notice how the reconstruction 
filter coefficients are twice the analysis coefficients. The computation is shown as flowing 


down the page, regardless of whether we are going from coarse to fine or vice versa. 
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Figure 3.33 The Laplacian pyramid. The yellow images form the Gaussian pyramid, which 


Y 


is obtained by successively low-pass filtering and downsampling the input image. The blue 
images, together with the smallest low-pass image, which is needed for reconstruction, form 
the Laplacian pyramid. Each band-pass (blue) image is computed by upsampling and inter- 
polating the lower-resolution Gaussian pyramid image, resulting in a blurred version of that 
level's low-pass image, which is subtracted from the low-pass to yield the blue band-pass 
image. During reconstruction, the interpolated images and the (optionally filtered) high-pass 
images are added back together starting with the coarsest level. The Q box indicates quanti- 
zation or some other pyramid processing, e.g., noise removal by coring (setting small wavelet 


values to 0). 


To compute the Laplacian pyramid, Burt and Adelson first interpolate a lower resolu- 
tion image to obtain a reconstructed low-pass version of the original image (Figure 3.33). 
They then subtract this low-pass version from the original to yield the band-pass “Laplacian” 
image, which can be stored away for further processing. The resulting pyramid has perfect 
reconstruction, i.e., the Laplacian images plus the base-level Gaussian (Lə in Figure 3.33) 
are sufficient to exactly reconstruct the original image. Figure 3.32 shows the same com- 
putation in one dimension as a signal processing diagram, which completely captures the 


computations being performed during the analysis and re-synthesis stages. 


Burt and Adelson also describe a variant of the Laplacian pyramid, where the low-pass 


image is taken from the original blurred image rather than the reconstructed pyramid (piping 
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Figure 3.34 The difference of two low-pass filters results in a band-pass filter. The dashed 
blue lines show the close fit to a half-octave Laplacian of Gaussian. 


the output of the L box directly to the subtraction in Figure 3.33). This variant has less 
aliasing, since it avoids one downsampling and upsampling round-trip, but it is not self- 
inverting, since the Laplacian images are no longer adequate to reproduce the original image. 

As with the Gaussian pyramid, the term Laplacian is a bit of a misnomer, since their 


band-pass images are really differences of (approximate) Gaussians, or DoGs, 
DoG{I; 01,02} = Go, * I — Go, * I = (Go, — Go) * I. (3.70) 


A Laplacian of Gaussian (which we saw in (3.26)) is actually its second derivative, 


LoG{I;o} = V’ (Go * I) = (V?G,)*1, (8.71) 

where 22 9? 
Sd .12 
ve ay (3.72) 


is the Laplacian (operator) of a function. Figure 3.34 shows how the Differences of Gaussian 
and Laplacians of Gaussian look in both space and frequency. 

Laplacians of Gaussian have elegant mathematical properties, which have been widely 
studied in the scale-space community (Witkin 1983; Witkin, Terzopoulos, and Kass 1986; 
Lindeberg 1990; Nielsen, Florack, and Deriche 1997) and can be used for a variety of appli- 
cations including edge detection (Marr and Hildreth 1980; Perona and Malik 1990b), stereo 
matching (Witkin, Terzopoulos, and Kass 1987), and image enhancement (Nielsen, Florack, 
and Deriche 1997). 

One particularly useful application of the Laplacian pyramid is in the manipulation of 
local contrast as well as the tone mapping of high dynamic range images (Section 10.2.1). 
Paris, Hasinoff, and Kautz (2011) present a technique they call local Laplacian filters, which 


uses local range clipping in the construction of a modified Laplacian pyramid, as well as 
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Figure 3.35 Multiresolution pyramids: (a) pyramid with half-octave (quincunx) sampling 
(odd levels are colored gray for clarity). (b) wavelet pyramid—each wavelet level stores 3/4 
of the original pixels (usually the horizontal, vertical, and mixed gradients), so that the total 


number of wavelet coefficients and original pixels is the same. 


different accentuation and attenuation curves for small and large details, to implement edge- 
preserving filtering and tone mapping. Aubry, Paris et al. (2014) discuss how to accelerate this 
processing for monotone (single channel) images and also show style transfer applications. 
A less widely used variant is half-octave pyramids, shown in Figure 3.35a. These were 
first introduced to the vision community by Crowley and Stern (1984), who call them Dif- 
ference of Low-Pass (DOLP) transforms. Because of the small scale change between adja- 
cent levels, the authors claim that coarse-to-fine algorithms perform better. In the image- 
processing community, half-octave pyramids combined with checkerboard sampling grids 
are known as quincunx sampling (Feilner, Van De Ville, and Unser 2005). In detecting multi- 
scale features (Section 7.1.1), 1t is often common to use half-octave or even quarter-octave 
pyramids (Lowe 2004; Triggs 2004). However, in this case, the subsampling only occurs 
at every octave level, 1.e., the image is repeatedly blurred with wider Gaussians until a full 


octave of resolution change has been achieved (Figure 7.11). 


3.5.4 Wavelets 


While pyramids are used extensively in computer vision applications, some people use wavelet 
decompositions as an alternative. Wavelets are filters that localize a signal in both space and 
frequency (like the Gabor filter) and are defined over a hierarchy of scales. Wavelets provide 
a smooth way to decompose a signal into frequency components without blocking and are 


closely related to pyramids. 
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Figure 3.36 A wavelet decomposition of an image: (a) single level decomposition with 
horizontal, vertical, and diagonal detail wavelets constructed using PyWavelet code (https: 
//pywavelets.readthedocs.io); (b) coefficient magnitudes of a multi-level decomposition, with 
the high—high components in the lower right corner and the base in the upper left (Buccigrossi 
and Simoncelli 1999) © 1999 IEEE. Notice how the low-high and high-low components 
accentuate horizontal and vertical edges and gradients, while the high-high components store 
the less frequent mixed derivatives. 


Wavelets were originally developed in the applied math and signal processing communi- 
ties and were introduced to the computer vision community by Mallat (1989). Strang (1989), 
Simoncelli and Adelson (1990b), Rioul and Vetterli (1991), Chui (1992), and Meyer (1993) 
all provide nice introductions to the subject along with historical reviews, while Chui (1992) 
provides a more comprehensive review and survey of applications. Sweldens (1997) describes 
the lifting approach to wavelets that we discuss shortly. 


Wavelets are widely used in the computer graphics community to perform multi-resolution 
geometric processing (Stollnitz, DeRose, and Salesin 1996) and have also been used in com- 
puter vision for similar applications (Szeliski 1990b; Pentland 1994; Gortler and Cohen 1995; 
Yaou and Chang 1994; Lai and Vemuri 1997; Szeliski 2006b; Krishnan and Szeliski 2011; 
Krishnan, Fattal, and Szeliski 2013), as well as for multi-scale oriented filtering (Simoncelli, 
Freeman et al. 1992) and denoising (Portilla, Strela et al. 2003). 


As both image pyramids and wavelets decompose an image into multi-resolution descrip- 
tions that are localized in both space and frequency, how do they differ? The usual answer is 
that traditional pyramids are overcomplete, i.e., they use more pixels than the original image 
to represent the decomposition, whereas wavelets provide a tight frame, i.e., they keep the 
size of the decomposition the same as the image (Figure 3.35b). However, some wavelet 
families are, in fact, overcomplete in order to provide better shiftability or steering in orienta- 
tion (Simoncelli, Freeman et al. 1992). A better distinction, therefore, might be that wavelets 
are more orientation selective than regular band-pass pyramids. 
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Figure 3.37 = Two-dimensional wavelet decomposition: (a) high-level diagram showing the 
low-pass and high-pass transforms as single boxes; (b) separable implementation, which 
involves first performing the wavelet transform horizontally and then vertically. The I and 
F boxes are the interpolation and filtering boxes required to re-synthesize the image from its 


wavelet components. 


How are two-dimensional wavelets constructed? Figure 3.37a shows a high-level dia- 
gram of one stage of the (recursive) coarse-to-fine construction (analysis) pipeline alongside 
the complementary re-construction (synthesis) stage. In this diagram, the high-pass filter 
followed by decimation keeps 3/4 of the original pixels, while 1/4 of the low-frequency coef- 
ficients are passed on to the next stage for further analysis. In practice, the filtering is usually 
broken down into two separable sub-stages, as shown in Figure 3.37b. The resulting three 
wavelet images are sometimes called the high—high (H H), high-low (H L), and low-high 
(LH) images. The high-low and low-high images accentuate the horizontal and vertical 
edges and gradients, while the high—high image contains the less frequently occurring mixed 
derivatives (Figure 3.36). 

How are the high-pass H and low-pass L filters shown in Figure 3.37b chosen and how 
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Figure 3.38 One-dimensional wavelet transform: (a) usual high-pass + low-pass filters 
followed by odd (| 2.) and even (|. 2.) downsampling; (b) lifted version, which first selects 
the odd and even subsequences and then applies a low-pass prediction stage L and a high- 


pass correction stage C in an easily reversible manner. 


can the corresponding reconstruction filters J and F' be computed? Can filters be designed 
that all have finite impulse responses? This topic has been the main subject of study in the 
wavelet community for over two decades. The answer depends largely on the intended ap- 
plication, e.g., whether the wavelets are being used for compression, image analysis (feature 
finding), or denoising. Simoncelli and Adelson (1990b) show (in Table 4.1) some good odd- 
length quadrature mirror filter (QMF) coefficients that seem to work well in practice. 

Since the design of wavelet filters is such a tricky art, is there perhaps a better way? In- 
deed, a simpler procedure is to split the signal into its even and odd components and then 
perform trivially reversible filtering operations on each sequence to produce what are called 
lifted wavelets (Figures 3.38 and 3.39). Sweldens (1996) gives a wonderfully understandable 
introduction to the lifting scheme for second-generation wavelets, followed by a comprehen- 
sive review (Sweldens 1997). 


As Figure 3.38 demonstrates, rather than first filtering the whole input sequence (image) 
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Figure 3.39 Lifted transform shown as a signal processing diagram: (a) The analysis 
stage first predicts the odd value from its even neighbors, stores the difference wavelet, and 
then compensates the coarser even value by adding in a fraction of the wavelet. (b) The 
synthesis stage simply reverses the flow of computation and the signs of some of the filters 
and operations. The light blue lines show what happens ifwe use four taps for the prediction 


and correction instead of just two. 


with high-pass and low-pass filters and then keeping the odd and even sub-sequences, the 
lifting scheme first splits the sequence into its even and odd sub-components. Filtering the 
even sequence with a low-pass filter L and subtracting the result from the odd sequence 
1s trivially reversible: simply perform the same filtering and then add the result back in. 
Furthermore, this operation can be performed in place, resulting in significant space savings. 
The same applies to filtering the difference signal with the correction filter C, which is used to 
ensure that the even sequence is low-pass. A series of such lifting steps can be used to create 


more complex filter responses with low computational cost and guaranteed reversibility. 


This process can be more easily understood by considering the signal processing diagram 
in Figure 3.39. During analysis, the average of the even values is subtracted from the odd 
value to obtain a high-pass wavelet coefficient. However, the even samples still contain an 
aliased sample of the low-frequency signal. To compensate for this, a small amount of the 
high-pass wavelet is added back to the even sequence so that it is properly low-pass filtered. 
(It is easy to show that the effective low-pass filter is [— 1/3, 1/4, 3/4, 1/4, — 1/8], which is in- 
deed a low-pass filter.) During synthesis, the same operations are reversed with a judicious 
change in sign. 

Of course, we need not restrict ourselves to two-tap filters. Figure 3.39 shows as light 
blue arrows additional filter coefficients that could optionally be added to the lifting scheme 
without affecting its reversibility. In fact, the low-pass and high-pass filtering operations can 
be interchanged, e.g., we could use a five-tap cubic low-pass filter on the odd sequence (plus 


center value) first, followed by a four-tap cubic low-pass predictor to estimate the wavelet, 


164 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


(b) 


Figure 3.40 Steerable shiftable multiscale transforms (Simoncelli, Freeman et al. 1992) © 
1992 IEEE: (a) radial multi-scale frequency domain decomposition; (b) original image; (c) 


a set of four steerable filters; (d) the radial multi-scale wavelet decomposition. 


although I have not seen this scheme written down. 

Lifted wavelets are called second-generation wavelets because they can easily adapt to 
non-regular sampling topologies, e.g., those that arise in computer graphics applications such 
as multi-resolution surface manipulation (Schróder and Sweldens 1995). It also turns out that 
lifted weighted wavelets, i.e., wavelets whose coefficients adapt to the underlying problem 
being solved (Fattal 2009), can be extremely effective for low-level image manipulation tasks 
and also for preconditioning the kinds of sparse linear systems that arise in the optimization- 
based approaches to vision algorithms that we discuss in Chapter 4 (Szeliski 2006b; Krishnan 
and Szeliski 2011; Krishnan, Fattal, and Szeliski 2013). 

An alternative to the widely used “separable” approach to wavelet construction, which 
decomposes each level into horizontal, vertical, and “cross” sub-bands, is to use a represen- 
tation that is more rotationally symmetric and orientationally selective and also avoids the 
aliasing inherent in sampling signals below their Nyquist frequency.!? Simoncelli, Freeman 
et al. (1992) introduce such a representation, which they call a pyramidal radial frequency 
implementation of shiftable multi-scale transforms or, more succinctly, steerable pyramids. 
Their representation is not only overcomplete (which eliminates the aliasing problem) but is 


I5Such aliasing can often be seen as the signal content moving between bands as the original signal is slowly 
shifted. 
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also orientationally selective and has identical analysis and synthesis basis functions, i.e., it is 
self-inverting, just like “regular” wavelets. As a result, this makes steerable pyramids a much 
more useful basis for the structural analysis and matching tasks commonly used in computer 
vision. 

Figure 3.40a shows how such a decomposition looks in frequency space. Instead of re- 
cursively dividing the frequency domain into 2 x 2 squares, which results in checkerboard 
high frequencies, radial arcs are used instead. Figure 3.40d illustrates the resulting pyramid 
sub-bands. Even through the representation is overcomplete, i.e., there are more wavelet co- 
efficients than input pixels, the additional frequency and orientation selectivity makes this 
representation preferable for tasks such as texture analysis and synthesis (Portilla and Simon- 
celli 2000) and image denoising (Portilla, Strela et al. 2003; Lyu and Simoncelli 2009). 


3.5.5 Application: Image blending 


One of the most engaging and fun applications of the Laplacian pyramid presented in Sec- 
tion 3.5.3 is the creation of blended composite images, as shown in Figure 3.41 (Burt and 
Adelson 1983b). While splicing the apple and orange images together along the midline 
produces a noticeable cut, splining them together (as Burt and Adelson (1983b) called their 
procedure) creates a beautiful illusion of a truly hybrid fruit. The key to their approach is 
that the low-frequency color variations between the red apple and the orange are smoothly 
blended, while the higher-frequency textures on each fruit are blended more quickly to avoid 
“ghosting” effects when two textures are overlaid. 

To create the blended image, each source image is first decomposed into its own Lapla- 
cian pyramid (Figure 3.42, left and middle columns). Each band is then multiplied by a 
smooth weighting function whose extent is proportional to the pyramid level. The simplest 
and most general way to create these weights is to take a binary mask image (Figure 3.41g) 
and to construct a Gaussian pyramid from this mask. Each Laplacian pyramid image is then 
multiplied by its corresponding Gaussian mask and the sum of these two weighted pyramids 
is then used to construct the final image (Figure 3.42, right column). 

Figure 3.41e—h shows that this process can be applied to arbitrary mask images with 
surprising results. It is also straightforward to extend the pyramid blend to an arbitrary num- 
ber of images whose pixel provenance is indicated by an integer-valued label image (see 
Exercise 3.18). This is particularly useful in image stitching and compositing applications, 
where the exposures may vary between different images, as described in Section 8.4.4, where 
we also present more recent variants such as Poisson and gradient-domain blending (Pérez, 
Gangnet, and Blake 2003; Levin, Zomet et al. 2004). 
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(a) (b) 


(2) (h) 


Figure 3.41 Laplacian pyramid blending (Burt and Adelson 1983b) O 1983 ACM: (a) 
original image of apple, (b) original image of orange, (c) regular splice, (d) pyramid blend. 
A masked blend of two images: (e) first input image, (f) second input image, (g) region mask, 
(h) blended image. 


3.5 Pyramids and wavelets 


(2) (h) 


G) (k) (I) 


Figure 3.42 Laplacian pyramid blending details (Burt and Adelson 1983b) © 1983 ACM. 
The first three rows show the high, medium, and low-frequency parts of the Laplacian pyramid 
(taken from levels 0, 2, and 4). The left and middle columns show the original apple and 


orange images weighted by the smooth interpolation functions, while the right column shows 
the averaged contributions. 
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RE a WA az 


Figure 3.43 Image warping involves modifying the domain of an image function rather 


than its range. 


3.6 Geometric transformations 


In the previous sections, we saw how interpolation and decimation could be used to change 
the resolution of an image. In this section, we look at how to perform more general transfor- 
mations, such as image rotations or general warps. In contrast to the point processes we saw 


in Section 3.1, where the function applied to an image transforms the range of the image, 


g(x) = h(f(x)), (3.73) 


g(x) = f(h(x)), (3.74) 


as shown in Figure 3.43. 

We begin by studying the global parametric 2D transformation first introduced in Sec- 
tion 2.1.1. (Such a transformation is called parametric because it is controlled by a small 
number of parameters.) We then turn our attention to more local general deformations such 
as those defined on meshes (Section 3.6.2). Finally, we show in Section 3.6.3 how image 
warps can be combined with cross-dissolves to create interesting morphs (in-between ani- 
mations). For readers interested in more details on these topics, there is an excellent survey 
by Heckbert (1986) as well as very accessible textbooks by Wolberg (1990), Gomes, Darsa 
et al. (1999) and Akenine-Móller and Haines (2002). Note that Heckbert’s survey is on tex- 
ture mapping, which is how the computer graphics community refers to the topic of warping 


images onto surfaces. 


3.6.1 Parametric transformations 


Parametric transformations apply a global deformation to an image, where the behavior of the 


transformation is controlled by a small number of parameters. Figure 3.44 shows a few ex- 
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translation > 
~~ Bad O 
— io 
Euclidean 


Figure 3.44 Basic set of 2D geometric image transformations. 


Transformation Matrix #DoF Preserves Icon 
translation È | 2 orientation 
2x3 
rigid (Euclidean) IR tl 3 lengths Oo 
2x3 
similarity [sR t] 4 angles O 
2x3 
affine [a] 6 parallelism a. 
2x3 
projective El 8 straight lines C] 
3x3 


Table 3.3 Hierarchy of 2D coordinate transformations. Each transformation also pre- 
serves the properties listed in the rows below it, i.e., similarity preserves not only angles but 
also parallelism and straight lines. The 2 x 3 matrices are extended with a third [07 1] row 


to form a full 3 x 3 matrix for homogeneous coordinate transformations. 


amples of such transformations, which are based on the 2D geometric transformations shown 
in Figure 2.4. The formulas for these transformations were originally given in Table 2.1 and 
are reproduced here in Table 3.3 for ease of reference. 

In general, given a transformation specified by a formula x’ = h(x) and a source image 
f(x), how do we compute the values of the pixels in the new image g(x), as given in (3.74)? 
Think about this for a minute before proceeding and see if you can figure it out. 

If you are like most people, you will come up with an algorithm that looks something like 
Algorithm 3.1. This process is called forward warping or forward mapping and is shown in 
Figure 3.45a. Can you think of any problems with this approach? 

In fact, this approach suffers from several limitations. The process of copying a pixel 


f(x) to a location x’ in g is not well defined when x’ has a non-integer value. What do we 
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(b) 


Figure 3.45 Forward warping algorithm: (a) a pixel f(x) is copied to its corresponding 


location x' = h(x) in image g(x’); (b) detail of the source and destination pixel locations. 


procedure forwardWarp( f, h, out g): 
For every pixel x in f(x) 


1. Compute the destination location x’ = h(x). 


2. Copy the pixel f(x) to g(x’). 


Algorithm 3.1 Forward warping algorithm for transforming an image f(x) into an image 


g(x’) through the parametric transform x! = h(x). 


do in such a case? What would you do? 

You can round the value of x’ to the nearest integer coordinate and copy the pixel there, 
but the resulting image has severe aliasing and pixels that jump around a lot when animating 
the transformation. You can also “distribute” the value among its four nearest neighbors in 
a weighted (bilinear) fashion, keeping track of the per-pixel weights and normalizing at the 
end. This technique is called splatting and is sometimes used for volume rendering in the 
graphics community (Levoy and Whitted 1985; Levoy 1988; Westover 1989; Rusinkiewicz 
and Levoy 2000). Unfortunately, it suffers from both moderate amounts of aliasing and a 
fair amount of blur (loss of high-resolution detail). 

The second major problem with forward warping is the appearance of cracks and holes, 
especially when magnifying an image. Filling such holes with their nearby neighbors can 
lead to further aliasing and blurring. 

What can we do instead? A preferable solution is to use inverse warping (Algorithm 3.2), 
where each pixel in the destination image g(x’) is sampled from the original image f(x) 
(Figure 3.46). 

How does this differ from the forward warping algorithm? For one thing, since h(x’) 


is (presumably) defined for all pixels in g(x’), we no longer have holes. More importantly, 
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Figure 3.46 Inverse warping algorithm: (a) a pixel g(x") is sampled from its corresponding 


location x = h(x’) in image f(x); (b) detail of the source and destination pixel locations. 


procedure inverseWarp(f, h, out g): 


For every pixel x’ in g(x’) 


1. Compute the source location x = h(x’) 


2. Resample f(x) at location x and copy to g(x’) 


Algorithm 3.2 Inverse warping algorithm for creating an image g(x’) from an image f(x) 


using the parametric transform x’ = h(x). 


resampling an image at non-integer locations is a well-studied problem (general image inter- 
polation, see Section 3.5.2) and high-quality filters that control aliasing can be used. 

Where does the function h(x’) come from? Quite often, it can simply be computed as the 
inverse of h(x). In fact, all of the parametric transforms listed in Table 3.3 have closed form 
solutions for the inverse transform: simply take the inverse of the 3 x 3 matrix specifying the 
transform. 

In other cases, it is preferable to formulate the problem of image warping as that of re- 
sampling a source image f(x) given a mapping x = h(x’) from destination pixels x’ to 
source pixels x. For example, in optical flow (Section 9.3), we estimate the flow field as the 
location of the source pixel that produced the current pixel whose flow is being estimated, as 
opposed to computing the destination pixel to which it is going. Similarly, when correcting 
for radial distortion (Section 2.1.5), we calibrate the lens by computing for each pixel in the 
final (undistorted) image the corresponding pixel location in the original (distorted) image. 

What kinds of interpolation filter are suitable for the resampling process? Any of the fil- 
ters we studied in Section 3.5.2 can be used, including nearest neighbor, bilinear, bicubic, and 
windowed sinc functions. While bilinear is often used for speed (e.g., inside the inner loop 


of a patch-tracking algorithm, see Section 9.1.3), bicubic, and windowed sinc are preferable 
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where visual quality is important. 
To compute the value of f(x) at a non-integer location x, we simply apply our usual FIR 


resampling filter, 


g(z,y) = >> f(k,D)h(a—k,y=1), (3.75) 
k,l 


where (x, y) are the sub-pixel coordinate values and h(x, y) is some interpolating or smooth- 
ing kernel. Recall from Section 3.5.2 that when decimation is being performed, the smoothing 
kernel is stretched and re-scaled according to the downsampling rate r. 

Unfortunately, for a general (non-zoom) image transformation, the resampling rate r is 
not well defined. Consider a transformation that stretches the x dimensions while squashing 
the y dimensions. The resampling kernel should be performing regular interpolation along 
the x dimension and smoothing (to anti-alias the blurred image) in the y direction. This gets 
even more complicated for the case of general affine or perspective transforms. 

What can we do? Fortunately, Fourier analysis can help. The two-dimensional general- 


ization of the one-dimensional domain scaling law is 
g(Ax)  |A|-!G(A~7f). (3.76) 


For all of the transforms in Table 3.3 except perspective, the matrix A is already defined. 
For perspective transformations, the matrix A is the linearized derivative of the perspective 
transformation (Figure 3.47a), i.e., the local affine approximation to the stretching induced 
by the projection (Heckbert 1986; Wolberg 1990; Gomes, Darsa et al. 1999; Akenine-M6ller 
and Haines 2002). 

To prevent aliasing, we need to prefilter the image f(x) with a filter whose frequency 
response is the projection of the final desired spectrum through the A~7 transform (Szeliski, 
Winder, and Uyttendaele 2010). In general (for non-zoom transforms), this filter is non- 
separable and hence is very slow to compute. Therefore, a number of approximations to this 
filter are used in practice, include MIP-mapping, elliptically weighted Gaussian averaging, 
and anisotropic filtering (Akenine-Móller and Haines 2002). 


MIP-mapping 


MIP-mapping was first proposed by Williams (1983) as a means to rapidly prefilter images 
being used for texture mapping in computer graphics. A MIP-map"° is a standard image pyra- 
mid (Figure 3.31), where each level is prefiltered with a high-quality filter rather than a poorer 
quality approximation, such as Burt and Adelson’s (1983b) five-tap binomial. To resample 


'6The term “MIP” stands for multi in parvo, meaning “many in one”. 
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Figure 3.47 Anisotropic texture filtering: (a) Jacobian of transform A and the induced 
horizontal and vertical resampling rates {zz Qz'y, Ay! 2, yy }; (b) elliptical footprint of an 
EWA smoothing kernel; (c) anisotropic filtering using multiple samples along the major axis. 


Image pixels lie at line intersections. 


an image from a MIP-map, a scalar estimate of the resampling rate r is first computed. For 
example, r can be the maximum of the absolute values in A (which suppresses aliasing) or 
it can be the minimum (which reduces blurring). Akenine-Móller and Haines (2002) discuss 
these issues in more detail. 
Once a resampling rate has been specified, a fractional pyramid level is computed using 
the base 2 logarithm, 
l= logar. (3.77) 


One simple solution is to resample the texture from the next higher or lower pyramid level, 
depending on whether it is preferable to reduce aliasing or blur. A better solution is to re- 
sample both images and blend them linearly using the fractional component of l. Since most 
MIP-map implementations use bilinear resampling within each level, this approach is usu- 
ally called trilinear MIP-mapping. Computer graphics rendering APIs, such as OpenGL and 
Direct3D, have parameters that can be used to select which variant of MIP-mapping (and of 
the sampling rate r computation) should be used, depending on the desired tradeoff between 
speed and quality. Exercise 3.22 has you examine some of these tradeoffs in more detail. 


Elliptical Weighted Average 


The Elliptical Weighted Average (EWA) filter invented by Greene and Heckbert (1986) is 
based on the observation that the affine mapping x = Ax’ defines a skewed two-dimensional 
coordinate system in the vicinity of each source pixel x (Figure 3.47a). For every destina- 
tion pixel x’, the ellipsoidal projection of a small pixel grid in x’ onto x is computed (Fig- 


ure 3.47b). This is then used to filter the source image g(x) with a Gaussian whose inverse 
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Figure 3.48 One-dimensional signal resampling (Szeliski, Winder, and Uyttendaele 2010): 
(a) original sampled signal f (i); (b) interpolated signal gı (x); (c) warped signal ga(x); (d) 
filtered signal g3(x); (e) sampled signal f'(i). The corresponding spectra are shown below 


the signals, with the aliased portions shown in red. 


covariance matrix is this ellipsoid. 

Despite its reputation as a high-quality filter (Akenine-Móller and Haines 2002), we have 
found in our work (Szeliski, Winder, and Uyttendaele 2010) that because a Gaussian kernel 
is used, the technique suffers simultaneously from both blurring and aliasing, compared to 
higher-quality filters. The EWA is also quite slow, although faster variants based on MIP- 
mapping have been proposed, as described in (Szeliski, Winder, and Uyttendaele 2010). 


Anisotropic filtering 


An alternative approach to filtering oriented textures, which is sometimes implemented in 
graphics hardware (GPUs), is to use anisotropic filtering (Barkans 1997; Akenine-Móller and 
Haines 2002). In this approach, several samples at different resolutions (fractional levels in 
the MIP-map) are combined along the major axis of the EWA Gaussian (Figure 3.47c). 


Multi-pass transforms 


The optimal approach to warping images without excessive blurring or aliasing is to adap- 
tively prefilter the source image at each pixel using an ideal low-pass filter, i.e., an oriented 
skewed sinc or low-order (e.g., cubic) approximation (Figure 3.47a). Figure 3.48 shows how 
this works in one dimension. The signal is first (theoretically) interpolated to a continuous 


waveform, (ideally) low-pass filtered to below the new Nyquist rate, and then re-sampled to 
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mo. 


Figure 3.49 Image warping alternatives (Gomes, Darsa et al. 1999) O 1999 Morgan Kauf- 
mann: (a) sparse control points —> deformation grid; (b) denser set of control point corre- 


spondences; (c) oriented line correspondences; (d) uniform quadrilateral grid. 


the final desired resolution. In practice, the interpolation and decimation steps are concate- 
nated into a single polyphase digital filtering operation (Szeliski, Winder, and Uyttendaele 
2010). 

For parametric transforms, the oriented two-dimensional filtering and resampling opera- 
tions can be approximated using a series of one-dimensional resampling and shearing trans- 
forms (Catmull and Smith 1980; Heckbert 1989; Wolberg 1990; Gomes, Darsa et al. 1999; 
Szeliski, Winder, and Uyttendaele 2010). The advantage of using a series of one-dimensional 
transforms is that they are much more efficient (in terms of basic arithmetic operations) than 
large, non-separable, two-dimensional filter kernels. In order to prevent aliasing, however, it 
may be necessary to upsample in the opposite direction before applying a shearing transfor- 
mation (Szeliski, Winder, and Uyttendaele 2010). 


3.6.2 Mesh-based warping 


While parametric transforms specified by a small number of global parameters have many 
uses, local deformations with more degrees of freedom are often required. 

Consider, for example, changing the appearance of a face from a frown to a smile (Fig- 
ure 3.49a). What is needed in this case is to curve the corners of the mouth upwards while 
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leaving the rest of the face intact.!” To perform such a transformation, different amounts of 
motion are required in different parts of the image. Figure 3.49 shows some of the commonly 
used approaches. 

The first approach, shown in Figure 3.49a-b, is to specify a sparse set of corresponding 
points. The displacement of these points can then be interpolated to a dense displacement field 
(Chapter 9) using a variety of techniques, which are described in more detail in Section 4.1 
on scattered data interpolation. One possibility is to triangulate the set of points in one image 
(de Berg, Cheong et al. 2006; Litwinowicz and Williams 1994; Buck, Finkelstein et al. 2000) 
and to use an affine motion model (Table 3.3), specified by the three triangle vertices, inside 
each triangle. If the destination image is triangulated according to the new vertex locations, 
an inverse warping algorithm (Figure 3.46) can be used. If the source image is triangulated 
and used as a texture map, computer graphics rendering algorithms can be used to draw the 
new image (but care must be taken along triangle edges to avoid potential aliasing). 

Alternative methods for interpolating a sparse set of displacements include moving nearby 
quadrilateral mesh vertices, as shown in Figure 3.49a, using variational (energy minimizing) 
interpolants such as regularization (Litwinowicz and Williams 1994), see Section 4.2, or using 
locally weighted (radial basis function) combinations of displacements (Section 4.1.1). (See 
Section 4.1 for additional scattered data interpolation techniques.) If quadrilateral meshes are 
used, it may be desirable to interpolate displacements down to individual pixel values using 
a smooth interpolant such as a quadratic B-spline (Farin 2002; Lee, Wolberg ef al. 1996). 

In some cases, e.g., if a dense depth map has been estimated for an image (Shade, Gortler 
et al. 1998), we only know the forward displacement for each pixel. As mentioned before, 
drawing source pixels at their destination location, i.e., forward warping (Figure 3.45), suffers 
from several potential problems, including aliasing and the appearance of small cracks. An 
alternative technique in this case is to forward warp the displacement field (or depth map) 
to its new location, fill small holes in the resulting map, and then use inverse warping to 
perform the resampling (Shade, Gortler et al. 1998). The reason that this generally works 
better than forward warping is that displacement fields tend to be much smoother than images, 
so the aliasing introduced during the forward warping of the displacement field is much less 
noticeable. 

A second approach to specifying displacements for local deformations is to use corre- 
sponding oriented line segments (Beier and Neely 1992), as shown in Figures 3.49c and 3.50. 
Pixels along each line segment are transferred from source to destination exactly as specified, 


and other pixels are warped using a smooth interpolation of these displacements. Each line 


'7See Section 6.2.4 on active appearance models for more sophisticated examples of changing facial expression 


and appearance. 
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For each pixel X in the destination 
DSUM = (0.0) 
weightsum = () 
For each line P; Q; 
calculate u,v based on P; Q; 
calculate X’; based on u,v and P;'Q;" 
calculate displacement D; = X;' - X; for this line 
dist = shortest distance from X to P; Q; 
weight = (length? | (a+ dist))? 
DSUM += D,* weight 
weightsum += weight 
d X'=X + DSUM | weighisum 
Destination Image Source Image destinationImage(X) = sourcelmage(X”) 


(b) 


Figure 3.50  Line-based image warping (Beier and Neely 1992) O 1992 ACM: (a) distance 
computation and position transfer; (b) rendering algorithm; (c) two intermediate warps used 


for morphing. 


segment correspondence specifies a translation, rotation, and scaling, i.e., a similarity trans- 
form (Table 3.3), for pixels in its vicinity, as shown in Figure 3.50a. Line segments influence 
the overall displacement of the image using a weighting function that depends on the mini- 
mum distance to the line segment (v in Figure 3.50a if u € [0, 1], else the shorter of the two 
distances to P and Q). 

One final possibility for specifying displacement fields is to use a mesh specifically 
adapted to the underlying image content, as shown in Figure 3.49d. Specifying such meshes 
by hand can involve a fair amount of work; Gomes, Darsa et al. (1999) describe an interactive 
system for doing this. Once the two meshes have been specified, intermediate warps can be 
generated using linear interpolation and the displacements at mesh nodes can be interpolated 


using splines. 


3.6.3 Application: Feature-based morphing 


While warps can be used to change the appearance of or to animate a single image, even 
more powerful effects can be obtained by warping and blending two or more images using 
a process now commonly known as morphing (Beier and Neely 1992; Lee, Wolberg et al. 
1996; Gomes, Darsa et al. 1999). 

Figure 3.51 shows the essence of image morphing. Instead of simply cross-dissolving 
between two images, which leads to ghosting as shown in the top row, each image is warped 
toward the other image before blending, as shown in the bottom row. If the correspondences 
have been set up well (using any of the techniques shown in Figure 3.49), corresponding 
features are aligned and no ghosting results. 

The above process is repeated for each intermediate frame being generated during a 
morph, using different blends (and amounts of deformation) at each interval. Let t € [0, 1] be 
the time parameter that describes the sequence of interpolated frames. The weighting func- 
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Figure 3.51 Image morphing (Gomes, Darsa et al. 1999) © 1999 Morgan Kaufmann. Top 
row: if the two images are just blended, visible ghosting results. Bottom row: both images 
are first warped to the same intermediate location (e.g., halfway towards the other image) 


and the resulting warped images are then blended resulting in a seamless morph. 


tions for the two warped images in the blend are (1 — t) and t and the movements of the 
pixels specified by the correspondences are also linearly interpolated. Some care must be 
taken in defining what it means to partially warp an image towards a destination, especially 
if the desired motion is far from linear (Sederberg, Gao et al. 1993). Exercise 3.25 has you 


implement a morphing algorithm and test it out under such challenging conditions. 


3.7 Additional reading 


If you are interested in exploring the topic of image processing in more depth, some popular 
textbooks have been written by Gomes and Velho (1997), Jáhne (1997), Pratt (2007), Burger 
and Burge (2009), and Gonzalez and Woods (2017). The pre-eminent conference and jour- 
nal in this field are the IEEE International Conference on Image Processing and the IEEE 
Transactions on Image Processing. 

For image compositing operators, the seminal reference is by Porter and Duff (1984) 
while Blinn (1994a,b) provides a more detailed tutorial. For image compositing, Smith and 
Blinn (1996) were the first to bring this topic to the attention of the graphics community, 
while Wang and Cohen (2009) provide a good in-depth survey. 

In the realm of linear filtering, Freeman and Adelson (1991) provide a great introduc- 
tion to separable and steerable oriented band-pass filters, while Perona (1995) shows how to 
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approximate any filter as a sum of separable components. 

The literature on non-linear filtering is quite wide and varied; it includes such topics as bi- 
lateral filtering (Tomasi and Manduchi 1998; Durand and Dorsey 2002; Chen, Paris, and Du- 
rand 2007; Paris and Durand 2009; Paris, Kornprobst ef al. 2008), related iterative algorithms 
(Saint-Marc, Chen, and Medioni 1991; Nielsen, Florack, and Deriche 1997; Black, Sapiro 
et al. 1998; Weickert, ter Haar Romeny, and Viergever 1998; Weickert 1998; Barash 2002; 
Scharr, Black, and Haussecker 2003; Barash and Comaniciu 2004) and variational approaches 
(Chan, Osher, and Shen 2001; Tschumperlé and Deriche 2005; Tschumperlé 2006; Kaftory, 
Schechner, and Zeevi 2007), and guided filtering (Eisemann and Durand 2004; Petschnigg, 
Agrawala et al. 2004; He, Sun, and Tang 2013). 

Good references to image morphology include Haralick and Shapiro (1992, Section 5.2), 
Bovik (2000, Section 2.2), Ritter and Wilson (2000, Section 7) Serra (1982), Serra and Vin- 
cent (1992), Yuille, Vincent, and Geiger (1992), and Soille (2006). 

The classic papers for image pyramids and pyramid blending are by Burt and Adelson 
(1983a,b). Wavelets were first introduced to the computer vision community by Mallat (1989) 
and good tutorial and review papers and books are available (Strang 1989; Simoncelli and 
Adelson 1990b; Rioul and Vetterli 1991; Chui 1992; Meyer 1993; Sweldens 1997). Wavelets 
are widely used in the computer graphics community to perform multi-resolution geometric 
processing (Stollnitz, DeRose, and Salesin 1996) and have been used in computer vision 
for similar applications (Szeliski 1990b; Pentland 1994; Gortler and Cohen 1995; Yaou and 
Chang 1994; Lai and Vemuri 1997; Szeliski 2006b; Krishnan and Szeliski 2011; Krishnan, 
Fattal, and Szeliski 2013), as well as for multi-scale oriented filtering (Simoncelli, Freeman 
et al. 1992) and denoising (Portilla, Strela et al. 2003). 

While image pyramids (Section 3.5.3) are usually constructed using linear filtering op- 
erators, more recent work uses non-linear filters, since these can better preserve details and 
other salient features. Some representative papers in the computer vision literature are by 
Gluckman (2006a,b); Lyu and Simoncelli (2008) and in computational photography by Bae, 
Paris, and Durand (2006), Farbman, Fattal et al. (2008), and Fattal (2009). 

High-quality algorithms for image warping and resampling are covered both in the image 
processing literature (Wolberg 1990; Dodgson 1992; Gomes, Darsa et al. 1999; Szeliski, 
Winder, and Uyttendaele 2010) and in computer graphics (Williams 1983; Heckbert 1986; 
Barkans 1997; Weinhaus and Devarajan 1997; Akenine-Móller and Haines 2002), where they 
go under the name of texture mapping. Combinations of image warping and image blending 
techniques are used to enable morphing between images, which is covered in a series of 
seminal papers and books (Beier and Neely 1992; Gomes, Darsa et al. 1999). 
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3.8 Exercises 


Ex 3.1: Color balance. Write a simple application to change the color balance of an image 
by multiplying each color value by a different user-specified constant. If you want to get 


fancy, you can make this application interactive, with sliders. 


1. Do you get different results 1f you take out the gamma transformation before or after 


doing the multiplication? Why or why not? 


2. Take the same picture with your digital camera using different color balance settings 
(most cameras control the color balance from one of the menus). Can you recover what 
the color balance ratios are between the different settings? You may need to put your 
camera on a tripod and align the images manually or automatically to make this work. 
Alternatively, use a color checker chart (Figure 10.3b), as discussed in Sections 2.3 and 
10.1.1. 


3. Can you think of any reason why you might want to perform a color twist (Sec- 


tion 3.1.2) on the images? See also Exercise 2.8 for some related ideas. 


Ex 3.2: Demosaicing. If you have access to the RAW image for the camera, perform the 
demosaicing yourself (Section 10.3.1). If not, just subsample an RGB image in a Bayer 
mosaic pattern. Instead of just bilinear interpolation, try one of the more advanced techniques 
described in Section 10.3.1. Compare your result to the one produced by the camera. Does 
your camera perform a simple linear mapping between RAW values and the color-balanced 
values in a JPEG? Some high-end cameras have a RAW+JPEG mode, which makes this 


comparison much easier. 


Ex 3.3: Compositing and reflections. Section 3.1.3 describes the process of compositing 
an alpha-matted image on top of another. Answer the following questions and optionally 


validate them experimentally: 


1. Most captured images have gamma correction applied to them. Does this invalidate the 


basic compositing equation (3.8); if so, how should it be fixed? 


2. The additive (pure reflection) model may have limitations. What happens if the glass is 
tinted, especially to a non-gray hue? How about if the glass is dirty or smudged? How 
could you model wavy glass or other kinds of refractive objects? 
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Ex 3.4: Blue screen matting. Set up a blue or green background, e.g., by buying a large 
piece of colored posterboard. Take a picture of the empty background, and then of the back- 
ground with a new object in front of it. Pull the matte using the difference between each 
colored pixel and its assumed corresponding background pixel, using one of the techniques 
described in Section 3.1.3 or by Smith and Blinn (1996). 


Ex 3.5: Difference keying. Implement a difference keying algorithm (see Section 3.1.3) 
(Toyama, Krumm ef al. 1999), consisting of the following steps: 


1. Compute the mean and variance (or median and robust variance) at each pixel in an 


“empty” video sequence. 


2. For each new frame, classify each pixel as foreground or background (set the back- 
ground pixels to RGBA=0). 


3. (Optional) Compute the alpha channel and composite over a new background. 


4. (Optional) Clean up the image using morphology (Section 3.3.1), label the connected 
components (Section 3.3.3), compute their centroids, and track them from frame to 


frame. Use this to build a “people counter”. 


Ex 3.6: Photo effects. Write a variety of photo enhancement or effects filters: contrast, 
solarization (quantization), etc. Which ones are useful (perform sensible corrections) and 


which ones are more creative (create unusual images)? 


Ex 3.7: Histogram equalization. Compute the gray level (luminance) histogram for an im- 
age and equalize it so that the tones look better (and the image is less sensitive to exposure 
settings). You may want to use the following steps: 


1. Convert the color image to luminance (Section 3.1.2). 


2. Compute the histogram, the cumulative distribution, and the compensation transfer 
function (Section 3.1.4). 


3. (Optional) Try to increase the “punch” in the image by ensuring that a certain fraction 


of pixels (say, 5%) are mapped to pure black and white. 


4. (Optional) Limit the local gain f'(T) in the transfer function. One way to do this is to 
limit f(I) < yI or f'(1) < y while performing the accumulation (3.9), keeping any 
unaccumulated values “in reserve”. (P'11 let you figure out the exact details.) 
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5. Compensate the luminance channel through the lookup table and re-generate the color 


image using color ratios (2.117). 


6. (Optional) Color values that are clipped in the original image, i.e., have one or more 
saturated color channels, may appear unnatural when remapped to a non-clipped value. 
Extend your algorithm to handle this case in some useful way. 


Ex 3.8: Local histogram equalization. Compute the gray level (luminance) histograms for 
each patch, but add to vertices based on distance (a spline). 


1. Build on Exercise 3.7 (luminance computation). 

2. Distribute values (counts) to adjacent vertices (bilinear). 
3. Convert to CDF (look-up functions). 

4. (Optional) Use low-pass filtering of CDFs. 


5. Interpolate adjacent CDFs for final lookup. 


Ex 3.9: Padding for neighborhood operations. Write down the formulas for computing 
the padded pixel values f (i, j) as a function of the original pixel values f (k,l) and the image 
width and height (M, N) for each of the padding modes shown in Figure 3.13. For example, 
for replication (clamping), 


Ea k = max(0, min(M — 1,1)), 
FEJ =$% 0D, max(0, min(N — 1, 5)), 


(Hint: you may want to use the min, max, mod, and absolute value operators in addition to 
the regular arithmetic operators.) 


e Describe in more detail the advantages and disadvantages of these various modes. 


e (Optional) Check what your graphics card does by drawing a texture-mapped rectangle 
where the texture coordinates lie beyond the [0.0, 1.0] range and using different texture 


clamping modes. 


Ex 3.10: Separable filters. Implement convolution with a separable kernel. The input should 
be a grayscale or color image along with the horizontal and vertical kernels. Make sure you 
support the padding mechanisms developed in the previous exercise. You will need this func- 
tionality for some of the later exercises. If you already have access to separable filtering in an 


image processing package you are using (such as IPL), skip this exercise. 
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e (Optional) Use Pietro Perona’s (1995) technique to approximate convolution as a sum 
of a number of separable kernels. Let the user specify the number of kernels and report 


back some sensible metric of the approximation fidelity. 


Ex 3.11: Discrete Gaussian filters. Discuss the following issues with implementing a dis- 
crete Gaussian filter: 


e If you just sample the equation of a continuous Gaussian filter at discrete locations, 
will you get the desired properties, e.g., will the coefficients sum up to 1? Similarly, if 
you sample a derivative of a Gaussian, do the samples sum up to 0 or have vanishing 
higher-order moments? 


e Would it be preferable to take the original signal, interpolate it with a sinc, blur with a 
continuous Gaussian, then prefilter with a sinc before re-sampling? Is there a simpler 


way to do this in the frequency domain? 


e Would it make more sense to produce a Gaussian frequency response in the Fourier 


domain and to then take an inverse FFT to obtain a discrete filter? 


e How does truncation of the filter change its frequency response? Does it introduce any 
additional artifacts? 


e Are the resulting two-dimensional filters as rotationally invariant as their continuous 
analogs? Is there some way to improve this? In fact, can any 2D discrete (separable or 


non-separable) filter be truly rotationally invariant? 


Ex 3.12: Sharpening, blur, and noise removal. Implement some softening, sharpening, and 
non-linear diffusion (selective sharpening or noise removal) filters, such as Gaussian, median, 
and bilateral (Section 3.3.1), as discussed in Section 3.4.2. 

Take blurry or noisy images (shooting in low light is a good way to get both) and try to 


improve their appearance and legibility. 


Ex 3.13: Steerable filters. Implement Freeman and Adelson’s (1991) steerable filter algo- 
rithm. The input should be a grayscale or color image and the output should be a multi-banded 
image consisting of ar and GY. The coefficients for the filters can be found in the paper 
by Freeman and Adelson (1991). 

Test the various order filters on a number of images of your choice and see if you can 
reliably find corner and intersection features. These filters will be quite useful later to detect 


elongated structures, such as lines (Section 7.4). 


184 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


(a) (b) 


Figure 3.52 Sample images for testing the quality of resampling algorithms: (a) a synthetic 
chirp; (b) and (c) some high-frequency images from the image compression community. 


Ex 3.14: Bilateral and guided image filters. Implement or download code for bilateral and/or 
guided image filtering and use this to implement some image enhancement or processing ap- 
plication, such as those described in Section 3.3.2 


Ex 3.15: Fourier transform. Prove the properties of the Fourier transform listed in Szeliski 
(2010, Table 3.1) and derive the formulas for the Fourier transforms pairs listed in Szeliski 
(2010, Table 3.2) and Table 3.1. These exercises are very useful if you want to become com- 
fortable working with Fourier transforms, which is a very useful skill when analyzing and 


designing the behavior and efficiency of many computer vision algorithms. 


Ex 3.16: High-quality image resampling. Implement several of the low-pass filters pre- 
sented in Section 3.5.2 and also the windowed sinc shown in Figure 3.28. Feel free to imple- 
ment other filters (Wolberg 1990; Unser 1999). 

Apply your filters to continuously resize an image, both magnifying (interpolating) and 
minifying (decimating) it; compare the resulting animations for several filters. Use both a 
synthetic chirp image (Figure 3.52a) and natural images with lots of high-frequency detail 
(Figure 3.52b—c). 

You may find it helpful to write a simple visualization program that continuously plays the 
animations for two or more filters at once and that let you “blink” between different results. 

Discuss the merits and deficiencies of each filter, as well as the tradeoff between speed 
and quality. 


Ex 3.17: Pyramids. Construct an image pyramid. The inputs should be a grayscale or 
color image, a separable filter kernel, and the number of desired levels. Implement at least 
the following kernels: 
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e 2 x 2 block filtering; 
e Burt and Adelson’s binomial kernel 1/16(1, 4, 6, 4, 1) (Burt and Adelson 1983a); 
e a high-quality seven- or nine-tap filter. 


Compare the visual quality of the various decimation filters. Also, shift your input image by 


1 to 4 pixels and compare the resulting decimated (quarter size) image sequence. 


Ex 3.18: Pyramid blending. Write a program that takes as input two color images and a 


binary mask image and produces the Laplacian pyramid blend of the two images. 
1. Construct the Laplacian pyramid for each image. 


2. Construct the Gaussian pyramid for the two mask images (the input image and its 


complement). 


3. Multiply each Laplacian image by its corresponding mask and sum the images (see 
Figure 3.41). 


4. Reconstruct the final image from the blended Laplacian pyramid. 


Generalize your algorithm to input n images and a label image with values 1... (the value 
0 can be reserved for “no input”). Discuss whether the weighted summation stage (step 3) 
needs to keep track of the total weight for renormalization, or whether the math just works 
out. Use your algorithm either to blend two differently exposed image (to avoid under- and 
over-exposed regions) or to make a creative blend of two different scenes. 


Ex 3.19: Pyramid blending in PyTorch. Re-write your pyramid blending exercise in Py- 
Torch. 


1. PyTorch has support for all of the primitives you need, i.e., fixed size convolutions 
(make sure they filter each channel separately), downsampling, upsampling, and addi- 
tion, subtraction, and multiplication (although the latter is rarely used). 


2. The goal of this exercise is not to train the convolution weights, but just to become 


familiar with the DNN primitives available in PyTorch. 


3. Compare your results to the ones using a standard Python or C++ computer vision 


library. They should be identical. 


4. Discuss whether you like this API better or worse for these kinds of fixed pipeline 


imaging tasks. 
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Ex 3.20: Local Laplacian—challenging. Implement the local Laplacian contrast manipu- 
lation technique (Paris, Hasinoff, and Kautz 2011; Aubry, Paris et al. 2014) and use this to 


implement edge-preserving filtering and tone manipulation. 


Ex 3.21: Wavelet construction and applications. Implement one of the wavelet families 
described in Section 3.5.4 or by Simoncelli and Adelson (1990b), as well as the basic Lapla- 
cian pyramid (Exercise 3.17). Apply the resulting representations to one of the following two 
tasks: 


e Compression: Compute the entropy in each band for the different wavelet implemen- 
tations, assuming a given quantization level (say, Ya gray level, to keep the rounding 
error acceptable). Quantize the wavelet coefficients and reconstruct the original im- 
ages. Which technique performs better? (See Simoncelli and Adelson (1990b) or any 
of the multitude of wavelet compression papers for some typical results.) 


Denoising. After computing the wavelets, suppress small values using coring, 1.e., set 
small values to zero using a piecewise linear or other C° function. Compare the results 
of your denoising using different wavelet and pyramid representations. 


Ex 3.22: Parametric image warping. Write the code to do affine and perspective image 
warps (optionally bilinear as well). Try a variety of interpolants and report on their visual 


quality. In particular, discuss the following: 


e In a MIP-map, selecting only the coarser level adjacent to the computed fractional 
level will produce a blurrier image, while selecting the finer level will lead to aliasing. 
Explain why this is so and discuss whether blending an aliased and a blurred image 
(tri-linear MIP-mapping) is a good idea. 


e When the ratio of the horizontal and vertical resampling rates becomes very different 
(anisotropic), the MIP-map performs even worse. Suggest some approaches to reduce 
such problems. 


Ex 3.23: Local image warping. Open an image and deform its appearance in one of the 
following ways: 


1. Click on a number of pixels and move (drag) them to new locations. Interpolate the 
resulting sparse displacement field to obtain a dense motion field (Sections 3.6.2 and 
3.5: 1). 


2. Draw a number of lines in the image. Move the endpoints of the lines to specify their 
new positions and use the Beier—Neely interpolation algorithm (Beier and Neely 1992), 
discussed in Section 3.6.2, to get a dense motion field. 
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3. Overlay a spline control grid and move one grid point at a time (optionally select the 
level of the deformation). 


4. Have a dense per-pixel flow field and use a soft “paintbrush” to design a horizontal and 


vertical velocity field. 


5. (Optional): Prove whether the Beier-Neely warp does or does not reduce to a sparse 


point-based deformation as the line segments become shorter (reduce to points). 


Ex 3.24: Forward warping. Given a displacement field from the previous exercise, write 


a forward warping algorithm: 


1. Write a forward warper using splatting, either nearest neighbor or soft accumulation 
(Section 3.6.1). 


2. Write a two-pass algorithm that forward warps the displacement field, fills in small 
holes, and then uses inverse warping (Shade, Gortler et al. 1998). 


3. Compare the quality of these two algorithms. 


Ex 3.25: Feature-based morphing. Extend the warping code you wrote in Exercise 3.23 
to import two different images and specify correspondences (point, line, or mesh-based) be- 


tween the two images. 


1. Create a morph by partially warping the images towards each other and cross-dissolving 
(Section 3.6.3). 


2. Try using your morphing algorithm to perform an image rotation and discuss whether 


it behaves the way you want it to. 


Ex 3.26: 2D image editor. Extend the program you wrote in Exercise 2.2 to import images 


and let you create a “collage” of pictures. You should implement the following steps: 
1. Open up a new image (in a separate window). 
2. Shift drag (rubber-band) to crop a subregion (or select whole image). 
3. Paste into the current canvas. 


4. Select the deformation mode (motion model): translation, rigid, similarity, affine, or 


perspective. 


5. Drag any corner of the outline to change its transformation. 
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Figure 3.53 There is a faint image of a rainbow visible in the right-hand side of this picture. 
Can you think of a way to enhance it (Exercise 3.29)? 


6. (Optional) Change the relative ordering of the images and which image is currently 


being manipulated. 


The user should see the composition of the various images” pieces on top of each other. 

This exercise should be built on the image transformation classes supported in the soft- 
ware library. Persistence of the created representation (save and load) should also be sup- 
ported (for each image, save its transformation). 


Ex 3.27: 3D texture-mapped viewer. Extend the viewer you created in Exercise 2.3 to in- 
clude texture-mapped polygon rendering. Augment each polygon with (u, v, w) coordinates 


into an image. 


Ex 3.28: Image denoising. Implement at least two of the various image denoising tech- 
niques described in this chapter and compare them on both synthetically noised image se- 
quences and real-world (low-light) sequences. Does the performance of the algorithm de- 
pend on the correct choice of noise level estimate? Can you draw any conclusions as to 
which techniques work better? 


Ex 3.29: Rainbow enhancer—challenging. Take a picture containing a rainbow, such as 
Figure 3.53, and enhance the strength (saturation) of the rainbow. 


1. Draw an arc in the image delineating the extent of the rainbow. 


2. Fit an additive rainbow function (explain why it is additive) to this arc (it is best to work 
with linearized pixel values), using the spectrum as the cross-section, and estimating 
the width of the arc and the amount of color being added. This is the trickiest part of 
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the problem, as you need to tease apart the (low-frequency) rainbow pattern and the 


natural image hiding behind it. 


3. Amplify the rainbow signal and add it back into the image, re-applying the gamma 


function if necessary to produce the final image. 
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Figure 4.1 Examples of data interpolation and global optimization: (a) scattered data in- 
terpolation (curve fitting) (Bishop 2006) O 2006 Springer; (b) graphical model interpretation 
of first-order regularization; (c) colorization using optimization (Levin, Lischinski, and Weiss 
2004) O 2004 ACM; (d) multi-image photomontage formulated as an unordered label MRF 
(Agarwala, Dontcheva et al. 2004) O 2004 ACM. 
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In the previous chapter, we covered a large number of image processing operators that 
take as input one or more images and produce some filtered or transformed version of these 
images. In many situations, however, we are given incomplete data as input, such as depths at 
a sparse number of locations, or user scribbles suggesting how an image should be colorized 
or segmented (Figure 4.1c-d). 


The problem of interpolating a complete image (or more generally a function or field) 
from incomplete or varying quality data is often called scattered data interpolation. We 
begin this chapter with a review of techniques in this area, since in addition to being widely 
used in computer vision, they also form the basis of most machine learning algorithms, which 


we will study in the next chapter. 


Instead of doing an exhaustive survey, we present in Section 4.1 some easy-to-use tech- 
niques, such as triangulation, spline interpolation, and radial basis functions. While these 
techniques are widely used, they cannot easily be modified to provide controlled continuity, 
1.e., to produce the kinds of piecewise continuous reconstructions we expect when estimating 


depth maps, label maps, or even color images. 


For this reason, we introduce in Section 4.2 variational methods, which formulate the 
interpolation problem as the recovery of a piecewise smooth function subject to exact or ap- 
proximate data constraints. Because the smoothness is controlled using penalties formulated 
as norms of the function, this class of techniques are often called regularization or energy- 
based approaches. To find the minimum-energy solutions to these problems, we discretize 
them (typically on a pixel grid), resulting in a discrete energy, which can then be minimized 


using sparse linear systems or related iterative techniques. 


In the last part of this chapter, Section 4.3, we show how such energy-based formulations 
are related to Bayesian inference techniques formulated as Markov random fields, which are a 
special case of general probabilistic graphical models. In these formulations, data constraints 
can be interpreted as noisy and/or incomplete measurements, and piecewise smoothness con- 
straints as prior assumptions or models over the solution space. Such formulations are also 
often called generative models, since we can, in principle, generate random samples from 
the prior distribution to see if they conform with our expectations. Because the prior models 
can be more complex than simple smoothness constraints, and because the solution space 
can have multiple local minima, more sophisticated optimization techniques have been de- 


veloped, which we discuss in this section. 
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4.1 Scattered data interpolation 


The goal of scattered data interpolation is to produce a (usually continuous and smooth) 


function f(x) that passes through a set of data points dy placed at locations x; such that 


The related problem of scattered data approximation only requires the function to pass near 
the data points (Amidror 2002; Wendland 2004; Anjyo, Lewis, and Pighin 2014). This is 


usually formulated using a penalty function such as 
Ep => [|£(xx) — dell? (4.2) 
k 


with the squared norm in the above formula sometimes replaced by a different norm or robust 
function (Section 4.1.3). In statistics and machine learning, the problem of predicting an 
output function given a finite number of samples is called regression (Section 5.1). The x 
vectors are called the inputs and the outputs y are called the targets. Figure 4.1a shows 
an example of one-dimensional scattered data interpolation, while Figures 4.2 and 4.8 show 
some two-dimensional examples. 

At first glance, scattered data interpolation seems closely related to image interpolation, 
which we studied in Section 3.5.1. However, unlike images, which are regularly gridded, the 
data points in scattered data interpolation are irregularly placed throughout the domain, as 
shown in Figure 4.2. This requires some adjustments to the interpolation methods we use. 

If the domain x is two-dimensional, as is the case with images, one simple approach is to 
triangulate the domain x using the data locations x, as the triangle vertices. The resulting 
triangular network, shown in Figure 4.2a, is called a triangular irregular network (TIN), 
and was one of the early techniques used to produce elevation maps from scattered field 
measurements collected by surveys. 

The triangulation in Figure 4.2a was produced using a Delaunay triangulation, which is 
the most widely used planar triangulation technique due to its attractive computational prop- 
erties, such as the avoidance of long skinny triangles. Algorithms for efficiently computing 
such triangulation are readily available! and covered in textbooks on computational geometry 
(Preparata and Shamos 1985; de Berg, Cheong ef al. 2008). The Delaunay triangulation can 
be extended to higher-dimensional domains using the property of circumscribing spheres, i.e., 
the requirement that all selected simplices (triangles, tetrahedra, etc.) have no other vertices 
inside their circumscribing spheres. 


‘For example, https://docs.scipy.org/doc/scipy/reference/tutorial/spatial.html 
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Figure 4.2 Some simple scattered data interpolation and approximation algorithms: (a) 
a Delaunay triangulation defined over a set of data point locations; (b) data structure and 
intermediate results for the pull-push algorithm (Gortler, Grzeszczuk et al. 1996) © 1996 
ACM. 


Once the triangulation has been defined, it is straightforward to define a piecewise-linear 
interpolant over each triangle, resulting in an interpolant that is Co but not generally C1 
continuous. The formulas for the function inside each triangle are usually derived using 
barycentric coordinates, which attain their maximal values at the vertices and sum up to one 
(Farin 2002; Amidror 2002). 

If a smoother surface is desired as the interpolant, we can replace the piecewise linear 
functions on each triangle with higher-order splines, much as we did for image interpolation 
(Section 3.5.1). However, since these splines are now defined over irregular triangulations, 
more sophisticated techniques must be used (Farin 2002; Amidror 2002). Other, more recent 
interpolators based on geometric modeling techniques in computer graphics include subdivi- 
sion surfaces (Peters and Reif 2008). 

An alternative to triangulating the data points is to use a regular n-dimensional grid, as 
shown in Figure 4.2b. Splines defined on such domains are often called tensor product splines 
and have been used to interpolate scattered data (Lee, Wolberg, and Shin 1997). 

An even faster, but less accurate, approach is called the pull-push algorithm and was 
originally developed for interpolating missing 4D lightfield samples in a Lumigraph (Gortler, 
Grzeszczuk et al. 1996). The algorithm proceeds in three phases, as schematically illustrated 
in Figure 4.2b. 

First, the irregular data samples are splatted onto (i.e., spread across) the nearest grid 
vertices, using the same approach we discussed in Section 3.6.1 on parametric image trans- 
formations. The splatting operations accumulate both values and weights at nearby vertices. 
In the second, pull, phase, values and weights are computed at a hierarchical set of lower reso- 


lution grids by combining the coefficient values from the higher resolution grids. In the lower 
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resolution grids, the gaps (regions where the weights are low) become smaller. In the third, 
push, phase, information from each lower resolution grid is combined with the next higher 
resolution grid, filling in the gaps while not unduly blurring the higher resolution information 
already computed. Details of these three stages can be found in (Gortler, Grzeszczuk et al. 
1996). 

The pull-push algorithm is very fast, since it is essentially linear in the number of input 


data points and fine-level grid samples. 


4.1.1 Radial basis functions 


While the mesh-based representations I have just described can provide good-quality inter- 
polants, they are typically limited to low-dimensional domains, because the size of the mesh 
grows combinatorially with the dimensionality of the domain. In higher dimensions, it is 
common to use mesh-free approaches that define the desired interpolant as a weighted sum 
of basis functions, similar to the formulation used in image interpolation (3.64). In machine 
learning, such approaches are often called kernel functions or kernel regression (Bishop 2006, 
Chapter 6; Murphy 2012, Chapter 14; Schólkopf and Smola 2001). 

In more detail, the interpolated function f is a weighted sum (or superposition) of basis 


functions centered at each input data point 
f(x) = Y wed(||x — xall), (4.3) 
k 


where the x; are the locations of the scattered data points, the Hs are the radial basis functions 
(or kernels), and wz are the local weights associated with each kernel. The basis functions 
@() are called radial because they are applied to the radial distance between a data sample 
x, and an evaluation point x. The choice of @ determines the smoothness properties of the 
interpolant, while the choice of weights wg determines how closely the function approximates 
the input. 


Some commonly used basis functions (Anjyo, Lewis, and Pighin 2014) include 


Gaussian olr) = exp(—r?/c?) (4.4) 

Hardy multiquadric olr) = V(r? + e?) (4.5) 
Inverse multiquadric olr) =1//(r? + e?) (4.6) 
Thin plate spline (r) = r?° logr. (4.7) 


In these equations, r is the radial distance and c is a scale parameter that controls the size 


(radial falloff) of the basis functions, and hence its smoothness (more compact bases lead to 
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“peakier” solutions). The thin plate spline equation holds for two dimensions (the general 
n-dimensional spline is called the polyharmonic spline and is given in (Anjyo, Lewis, and 
Pighin 2014)) and is the analytic solution to the second degree variational spline derived in 
(4.19). 

If we want our function to exactly interpolate the data values, we solve the linear system 
of equations (4.1), 1.e., 


f(x) = Y > wid((|xx — xill) = da, (4.8) 
l 


to obtain the desired set of weights wọ. Note that for large amounts of basis function overlap 
(large values of c), these equations may be quite ill-conditioned, i.e., small changes in data 
values or locations can result in large changes in the interpolated function. Note also that the 
solution of such a system of equations is in general O(m3), where m is the number of data 
points (unless we use basis functions with finite extent to obtain a sparse set of equations). 

A more prudent approach is to solve the regularized data approximation problem, which 
involves minimizing the data constraint energy (4.2) together with a weight penalty (regular- 
izer) of the form 

Ew => |lwell?, (4.9) 
k 


and to then minimize the regularized least squares problem 


E({we}) = Ep + AEw (4.10) 
= |) welll = xil) — dell? +A XC lwll. (4.11) 
k l 


k 
When p = 2 (quadratic weight penalty), the resulting energy is a pure least squares problem, 
and can be solved using the normal equations (Appendix A.2), where the A value gets added 
along the diagonal to stabilize the system of equations. 

In statistics and machine learning, the quadratic (regularized least squares) problem is 
called ridge regression. In neural networks, adding a quadratic penalty on the weights is 
called weight decay, because it encourages weights to decay towards zero (Section 5.3.3). 
When p = 1, the technique is called lasso (least absolute shrinkage and selection operator), 
since for sufficiently large values of A, many of the weights wz, get driven to zero (Tibshirani 
1996; Bishop 2006; Murphy 2012; Deisenroth, Faisal, and Ong 2020). This results in a 
sparse set of basis functions being used in the interpolant, which can greatly speed up the 
computation of new values of f(x). We will have more to say on sparse kernel techniques in 
the section on Support Vector Machines (Section 5.1.4). 

An alternative to solving a set of equations to determine the weights wọ is to simply set 


them to the input data values dọ. However, this fails to interpolate the data, and instead 
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produces higher values in higher density regions. This can be useful if we are trying to 
estimate a probability density function from a set of samples. In this case, the resulting 
density function, obtained after normalizing the sum of sample-weighted basis functions to 
have a unit integral, is called the Parzen window or kernel approach to probability density 
estimation (Duda, Hart, and Stork 2001, Section 4.3; Bishop 2006, Section 2.5.1). Such 
probability densities can be used, among other things, for (spatially) clustering color values 
together for image segmentation in what is known as the mean shift approach (Comaniciu 
and Meer 2002) (Section 7.5.2). 

If, instead of just estimating a density, we wish to actually interpolate a set of data val- 
ues d, we can use a related technique known as kernel regression or the Nadaraya- Watson 
model, in which we divide the data-weighted summed basis functions by the sum of all the 


basis functions, 


o Endo- xal) 
AS pa” 


Note how this operation is similar, in concept, to the splatting method for forward rendering 


(4.12) 


we discussed in Section 3.6.1, except that here, the bases can be much wider than the nearest- 
neighbor bilinear bases used in graphics (Takeda, Farsiu, and Milanfar 2007). 
Kernel regression is equivalent to creating a new set of spatially varying normalized 


shifted basis functions 


_ ox xl) 
=>)" 


which form a partition of unity, 1.e., sum up to 1 at every location (Anjyo, Lewis, and Pighin 


r(x) (4.13) 


2014). While the resulting interpolant can now be written more succinctly as 
f(x) = $ dig (lx — xxl), (4.14) 
k 


in most cases, it is more expensive to precompute and store the K ¢/, functions than to 
evaluate (4.12). 

While not that widely used in computer vision, kernel regression techniques have been 
applied by Takeda, Farsiu, and Milanfar (2007) to a number of low-level image process- 
ing operations, including state-of-the-art handheld multi-frame super-resolution (Wronski, 
Garcia-Dorado et al. 2019). 

One last scattered data interpolation technique worth mentioning is moving least squares, 
where a weighted subset of nearby points is used to compute a local smooth surface. Such 
techniques are mostly widely used in 3D computer graphics, especially for point-based sur- 
face modeling, as discussed in Section 13.4 and (Alexa, Behr et al. 2003; Pauly, Keiser et al. 
2003; Anjyo, Lewis, and Pighin 2014). 
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Figure 4.3 Polynomial curve fitting to the blue circles, which are noisy samples from the 
green sine curve (Bishop 2006) © 2006 Springer. The four plots show the Oth order constant 
function, the first order linear fit, the M = 3 cubic polynomial, and the 9th degree polynomial. 
Notice how the first two curves exhibit underfitting, while the last curve exhibits overfitting, 


i.e., excessive wiggle. 


4.1.2 Overfitting and underfitting 


When we introduced weight regularization in (4.9), we said that it was usually preferable to 
approximate the data but we did not explain why. In most data fitting problems, the samples 
dy (and sometimes even their locations Xx) are noisy, so that fitting them exactly makes no 
sense. In fact, doing so can introduce a lot of spurious wiggles, when the true solution is 
likely to be smoother. 

To delve into this phenomenon, let us start with a simple polynomial fitting example 
taken from (Bishop 2006, Chapter 1.1). Figure 4.3 shows a number of polynomial curves of 
different orders M fit to the blue circles, which are noisy samples from the underlying green 
sine curve. Notice how the low-order (M = 0 and M = 1) polynomials severely underfit 
the underlying data, resulting in curves that are too flat, while the M = 9 polynomial, which 
exactly fits the data, exhibits far more wiggle than is likely. 

How can we quantify this amount of underfitting and overfitting, and how can we get just 


the right amount? This topic is widely studied in machine learning and covered in a number of 


200 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


Figure 4.4 Regularized M = 9 polynomial fitting for two different values of A (Bishop 
2006) © 2006 Springer. The left plot shows a reasonable amount of regularization, resulting 
in a plausible fit, while the larger value of A on the right causes underfitting. 


idealized Error Curves 


Underfitting Overfitting 


Error 


—— validation error 
training error 


Figure 4.5 Fitting (training) and validation errors as a function of the amount of regular- 
ization or smoothing © Glassner (2018). The less regularized solutions on the right, while 


exhibiting lower fitting error, perform less well on the validation data. 


texts, including Bishop (2006, Chapter 1.1), Glassner (2018, Chapter 9), Deisenroth, Faisal, 
and Ong (2020, Chapter 8), and Zhang, Lipton et al. (2021, Section 4.4.3). 


One approach is to use regularized least squares, introduced in (4.11). Figure 4.4 shows 
an M = 9th degree polynomial fit obtained by minimizing (4.11) with the polynomial basis 
functions ¢;,(x) = x* for two different values of A. The left plot shows a reasonable amount 
of regularization, resulting in a plausible fit, while the larger value of A on the right causes 
underfitting. Note that the M = 9 interpolant shown in the lower right quadrant of Figure 4.3 


corresponds to the unregularized = 0 case. 


If we were to now measure the difference between the red (estimated) and green (noise- 


free) curves, we see that choosing a good intermediate value of A will produce the best result. 
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In À = 2.6 


0 l 0 . l 


Figure 4.6 The more heavily regularized solution log A = 2.6 exhibits higher bias (devia- 
tion from original curve) than the less heavily regularized version (log A = —2.4), which has 
much higher variance (Bishop 2006) © 2006 Springer. The red curves on the left are M = 24 
Gaussian basis fits to 25 randomly sampled points on the green curve. The red curve on the 


right is their mean. 


In practice, however, we never have access to samples from the noise-free data. 


Instead, if we are given a set of samples to interpolate, we can save some in a validation 
set in order to see if the function we compute is underfitting or overfitting. When we vary a 
parameter such as A (or use some other measure to control smoothness), we typically obtain 
a curve such as the one shown in Figure 4.5. In this figure, the blue curve denotes the fitting 
error, which in this case is called the training error, since in machine learning, we usually 
split the given data into a (typically larger) training set and a (typically smaller) validation 
set. 


To obtain an even better estimate of the ideal amount of regularization, we can repeat the 
process of splitting our sample data into training and validation sets several times. One well- 
known technique, called cross-validation (Craven and Wahba 1979; Wahba and Wendelberger 
1980; Bishop 2006, Section 1.3; Murphy 2012, Section 1.4.8; Deisenroth, Faisal, and Ong 
2020, Chapter 8; Zhang, Lipton et al. 2021, Section 4.4.2), splits the training data into K 


folds (equal sized pieces). You then put aside each fold, in turn, and train on the remaining 
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data. You can then estimate the best regularization parameter by averaging over all K training 
runs. While this generally works well (K = 5 is often used), it may be too expensive when 
training large neural networks because of the long training times involved. 

Cross-validation is just one example of a class of model selection techniques that estimate 
hyperparameters in a training algorithm to achieve good performance. Additional methods 
include information criteria such as the Bayesian information criterion (BIC) (Torr 2002) and 
the Akaike information criterion (AIC) (Kanatani 1998), and Bayesian modeling approaches 
(Szeliski 1989; Bishop 2006; Murphy 2012). 

One last topic worth mention with regard to data fitting, since it comes up often in discus- 
sions of statistical machine learning techniques, is the bias-variance tradeoff (Bishop 2006, 
Section 3.2). As you can see in Figure 4.6, using a large amount of regularization (top row) 
results in much lower variance between different random sample solutions, but much higher 
bias away from the true solution. Using insufficient regularization increases the variance dra- 
matically, although an average over a large number of samples has low bias. The trick is to 
determine a reasonable compromise in terms of regularization so that any individual solution 


has a good expectation of being close to the ground truth (original clean continuous) data. 


4.1.3 Robust data fitting 


When we added a regularizer on the weights in (4.9), we noted that it did not have to be a 
quadratic penalty and could, instead, be a lower-order monomial that encouraged sparsity in 
the weights. 

This same idea can be applied to data terms such as (4.2), where, instead of using a 


quadratic penalty, we can use a robust loss function p(), 


Er =X p(|lrall), with ri = £(xx) — de, (4.15) 
k 


which gives lower weights to larger data fitting errors, which are more likely to be outlier 
measurements. (The fitting error term ry is called the residual error.) 

Some examples of loss functions from (Barron 2019) are shown in Figure 4.7 along with 
their derivatives. The regular quadratic (a = 2) penalty gives full (linear) weight to each 
error, whereas the a = 1 loss gives equal weight to all larger residuals, i.e., it behaves as 
an Lı loss for large residuals, and Lz for small ones. Even larger values of a discount large 
errors (outliers) even more, although they result in optimization problems that are non-convex, 
i.e., that can have multiple local minima. We will discuss techniques for finding good initial 


guesses for such problems later on in Section 8.1.4. 
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p(x, a, c) 90 (x, a,c) 
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Figure 4.7 A general and adaptive loss function (left) and its gradient (right) for different 
values of its shape parameter a (Barron 2019) © 2019 IEEE. Several values of a: reproduce 
existing loss functions: La loss (a = 2), Charbonnier loss (a = 1), Cauchy loss (a = 0), 
Geman-McClure loss (a = —2), and Welsch loss (a = —1). 


In statistics, minimizing non-quadratic loss functions to deal with potential outlier mea- 
surements is known as M-estimation (Huber 1981; Hampel, Ronchetti et al. 1986; Black 
and Rangarajan 1996; Stewart 1999). Such estimation problems are often solved using it- 
eratively reweighted least squares, which we discuss in more detail in Section 8.1.4 and 
Appendix B.3. The Appendix also discusses the relationship between robust statistics and 
non-Gaussian probabilistic models. 

The generalized loss function introduced by Barron (2019) has two free parameters. The 
first one, a, controls how drastically outlier residuals are downweighted. The second (scale) 
parameter c controls the width of the quadratic well near the minimum, i.e., what range of 
residual values roughly corresponds to inliers. Traditionally, the choice of a, which cor- 
responds to a variety of previously published loss functions, was determined heuristically, 
based on the expected shape of the outlier distribution and computational considerations (e.g., 
whether a convex loss was desired). The scale parameter c could be estimated using a robust 
measure of variance, as discussed in Appendix B.3. 

In his paper, Barron (2019) discusses how both parameters can be determined at run time 
by maximizing the likelihood (or equivalently, minimizing the negative log-likelihood) of the 
given residuals, making such an algorithm self-tuning to a wide variety of noise levels and 


outlier distributions. 
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Figure 4.8 A simple surface interpolation problem: (a) nine data points of various heights 
scattered on a grid; (b) second-order, controlled-continuity, thin-plate spline interpolator, 


with a tear along its left edge and a crease along its right (Szeliski 1989) O 1989 Springer. 


4.2 Variational methods and regularization 


The theory of regularization we introduced in the previous section was first developed by 
Statisticians trying to fit models to data that severely underconstrained the solution space 
(Tikhonov and Arsenin 1977; Engl, Hanke, and Neubauer 1996). Consider, for example, 
finding a smooth surface that passes through (or near) a set of measured data points (Fig- 
ure 4.8). Such a problem is described as ill-posed because many possible surfaces can fit this 
data. Since small changes in the input can sometimes lead to large changes in the fit (e.g., 
if we use polynomial interpolation), such problems are also often ill-conditioned. Since we 
are trying to recover the unknown function f(x, y) from which the data points d(x;, y;) were 
sampled, such problems are also often called inverse problems. Many computer vision tasks 
can be viewed as inverse problems, since we are trying to recover a full description of the 3D 


world from a limited set of images. 


In the previous section, we attacked this problem using basis functions placed at the data 
points, or other heuristics such as the pull-push algorithm. While such techniques can provide 
reasonable solutions, they do not let us directly quantify and hence optimize the amount of 
smoothness in the solution, nor do they give us local control over where the solution should 


be discontinuous (Figure 4.8). 


To do this, we use norms (measures) on function derivatives (described below) to formu- 
late the problem and then find minimal energy solutions to these norms. Such techniques 
are often called energy-based or optimization-based approaches to computer vision. They 
are also often called variational, since we can use the calculus of variations to find the opti- 


mal solutions. Variational methods have been widely used in computer vision since the early 
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1980s to pose and solve a number of fundamental problems, including optical flow (Horn 
and Schunck 1981; Black and Anandan 1993; Brox, Bruhn et al. 2004; Werlberger, Pock, 
and Bischof 2010), segmentation (Kass, Witkin, and Terzopoulos 1988; Mumford and Shah 
1989; Chan and Vese 2001), denoising (Rudin, Osher, and Fatemi 1992; Chan, Osher, and 
Shen 2001; Chan and Shen 2005), and multi-view stereo (Faugeras and Keriven 1998; Pons, 
Keriven, and Faugeras 2007; Kolev, Klodt et al. 2009). A more detailed list of relevant papers 
can be found in the Additional Reading section at the end of this chapter. 

In order to quantify what it means to find a smooth solution, we can define a norm on 
the solution space. For one-dimensional functions f(x), we can integrate the squared first 
derivative of the function, 


E& = | Plz) de (4.16) 


or perhaps integrate the squared second derivative, 


E2 = J f2 (x) de. (4.17) 


(Here, we use subscripts to denote differentiation.) Such energy measures are examples of 
functionals, which are operators that map functions to scalar values. They are also often called 
variational methods, because they measure the variation (non-smoothness) in a function. 

In two dimensions (e.g., for images, flow fields, or surfaces), the corresponding smooth- 
ness functionals are 


£, = i Pe 9) + (e, y) de dy = J IV f£, y)? dz dy (4.18) 


and 
E& = ] Fog (a, y) + 2f2,(2,y) + Fiy(z, y) de dy, (4.19) 


where the mixed 2 T term is needed to make the measure rotationally invariant (Grimson 
1983). 

The first derivative norm is often called the membrane, since interpolating a set of data 
points using this measure results in a tent-like structure. (In fact, this formula is a small- 
deflection approximation to the surface area, which is what soap bubbles minimize.) The 
second-order norm is called the thin-plate spline, since it approximates the behavior of thin 
plates (e.g., flexible steel) under small deformations. A blend of the two is called the thin- 
plate spline under tension (Terzopoulos 1986b). 

The regularizers (smoothness functions) we have just described force the solution to be 
smooth and Co and/or Cı continuous everywhere. In most computer vision applications, 


however, the fields we are trying to model or recover are only piecewise continuous, e.g., 
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depth maps and optical flow fields jump at object discontinuities. Color images are even 
more discontinuous, since they also change appearance at albedo (surface color) and shading 
discontinuities. 

To better model such functions, Terzopoulos (1986b) introduced controlled-continuity 


splines, where each derivative term is multiplied by a local weighting function, 


Epa I pla, y){[1 — r(x, v2, y) + ACA) 
+ T(x, y) lle, Y) +2, (0, y) + fi, (2, y)]) de dy. (4.20) 


Here, p(x, y) € [0,1] controls the continuity of the surface and T(x, y) € [0, 1] controls the 
local tension, i.e., how flat the surface wants to be. Figure 4.8 shows a simple example of 
a controlled-continuity interpolator fit to nine scattered data points. In practice, 1t is more 
common to find first-order smoothness terms used with images and flow fields (Section 9.3) 
and second-order smoothness associated with surfaces (Section 13.3.1). 

In addition to the smoothness term, variational problems also require a data term (or data 
penalty). For scattered data interpolation (Nielson 1993), the data term measures the distance 
between the function f(x, y) and a set of data points d; = d(x;, yi), 


En = $ [f (ais ys) — di]? (4.21) 


For a problem like noise removal, a continuous version of this measure can be used, 


E / [f(x, y) — d(x, y)? de dy. (4.22) 


To obtain a global energy that can be minimized, the two energy terms are usually added 
together, 
E = Ep + És, (4.23) 


where Es is the smoothness penalty (E1, Ez or some weighted blend such as Ecc) and A is 
the regularization parameter, which controls the smoothness of the solution. As we saw in 
Section 4.1.2, good values for the regularization parameter can be estimated using techniques 


such as cross-validation. 


4.2.1 Discrete energy minimization 


In order to find the minimum of this continuous problem, the function f(x,y) is usually first 


discretized on a regular grid.? The most principled way to perform this discretization is to use 


The alternative of using kernel basis functions centered on the data points (Boult and Kender 1986; Nielson 


1993) is discussed in more detail in Section 13.3.1. 
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finite element analysis, i.e., to approximate the function with a piecewise continuous spline, 
and then perform the analytic integration (Bathe 2007). 

Fortunately, for both the first-order and second-order smoothness functionals, the judi- 
cious selection of appropriate finite elements results in particularly simple discrete forms 


(Terzopoulos 1983). The corresponding discrete smoothness energy functions become 
E, = >> sei, DIF (6 +1, 4) — F j) — gel, DP 
ij (4.24) 
+ syli DFG, j +1) — FC, j) — WIP 
and 
E2=h° X cli F+., j) 2f i) + FE- 1,5) 
tj 


+2cm(i [F +1, j +1) Fli +1, j) Fijt) iN (4.25) 


teli JF (G5 +1) — 2F (4,9) + FG - DP, 


where A is the size of the finite element grid. The h factor is only important if the energy is 
being discretized at a variety of resolutions, as in coarse-to-fine or multigrid techniques. 

The optional smoothness weights s, (i,j) and sy(i, j) control the location of horizontal 
and vertical tears (or weaknesses) in the surface. For other problems, such as colorization 
(Levin, Lischinski, and Weiss 2004) and interactive tone mapping (Lischinski, Farbman et 
al. 2006), they control the smoothness in the interpolated chroma or exposure field and are 
often set inversely proportional to the local luminance gradient strength. For second-order 
problems, the crease variables cz (i, j), Cm(1, j), and c,(%, j) control the locations of creases 
in the surface (Terzopoulos 1988; Szeliski 1990a). 

The data values g,(i,7) and g,(i,7) are gradient data terms (constraints) used by al- 
gorithms, such as photometric stereo (Section 13.1.1), HDR tone mapping (Section 10.2.1) 
(Fattal, Lischinski, and Werman 2002), Poisson blending (Section 8.4.4) (Pérez, Gangnet, 
and Blake 2003), gradient-domain blending (Section 8.4.4) (Levin, Zomet et al. 2004), and 
Poisson surface reconstruction (Section 13.5.1) (Kazhdan, Bolitho, and Hoppe 2006; Kazh- 
dan and Hoppe 2013). They are set to zero when just discretizing the conventional first-order 
smoothness functional (4.18). Note how separate smoothness and curvature terms can be im- 
posed in the x, y, and mixed directions to produce local tears or creases (Terzopoulos 1988; 
Szeliski 1990a). 


The two-dimensional discrete data energy is written as 


Ep =X c(i, DIF, j) — dli, DP, (4.26) 


ij 
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where the local confidence weights c(i, j) control how strongly the data constraint is en- 
forced. These values are set to zero where there is no data and can be set to the inverse 
variance of the data measurements when there is data (as discussed by Szeliski (1989) and in 
Section 4.3). 


The total energy of the discretized problem can now be written as a quadratic form 
E = Ep + AEs = xT Ax — 2x’ b + c, (4.27) 


where x = [f (0,0)... f(m — 1,n — 1)] is called the state vector. 

The sparse symmetric positive-definite matrix A is called the Hessian since it encodes the 
second derivative of the energy function.* For the one-dimensional, first-order problem, A 
is tridiagonal; for the two-dimensional, first-order problem, it is multi-banded with five non- 
zero entries per row. We call b the weighted data vector. Minimizing the above quadratic 


form is equivalent to solving the sparse linear system 


Ax =b, (4.28) 


which can be done using a variety of sparse matrix techniques, such as multigrid (Briggs, 
Henson, and McCormick 2000) and hierarchical preconditioners (Szeliski 2006b; Krishnan 
and Szeliski 2011; Krishnan, Fattal, and Szeliski 2013), as described in Appendix A.5 and 
illustrated in Figure 4.11. Using such techniques is essential to obtaining reasonable run- 
times, since properly preconditioned sparse linear systems have convergence times that are 
linear in the number of pixels. 

While regularization was first introduced to the vision community by Poggio, Torre, and 
Koch (1985) and Terzopoulos (1986b) for problems such as surface interpolation, it was 
quickly adopted by other vision researchers for such varied problems as edge detection (Sec- 
tion 7.2), optical flow (Section 9.3), and shape from shading (Section 13.1) (Poggio, Torre, 
and Koch 1985; Horn and Brooks 1986; Terzopoulos 1986b; Bertero, Poggio, and Torre 1988; 
Brox, Bruhn et al. 2004). Poggio, Torre, and Koch (1985) also showed how the discrete en- 
ergy defined by Equations (4.24—4.26) could be implemented in a resistive grid, as shown 
in Figure 4.9. In computational photography (Chapter 10), regularization and its variants are 
commonly used to solve problems such as high-dynamic range tone mapping (Fattal, Lischin- 
ski, and Werman 2002; Lischinski, Farbman et al. 2006), Poisson and gradient-domain blend- 
ing (Pérez, Gangnet, and Blake 2003; Levin, Zomet et al. 2004; Agarwala, Dontcheva et al. 


3We use x instead of f because this is the more common form in the numerical analysis literature (Golub and 
Van Loan 1996). 
“In numerical analysis, A is called the coefficient matrix (Saad 2003); in finite element analysis (Bathe 2007), it 


is called the stiffness matrix. 


4.2 Variational methods and regularization 209 


fG j+1) 


c(i, j) si, j) 


fG j GL) 


O 
sxi, j) 


Figure 4.9 Graphical model interpretation of first-order regularization. The white circles 
are the unknowns f (i, j) while the dark circles are the input data d(i, j). In the resistive grid 
interpretation, the d and f values encode input and output voltages and the black squares 
denote resistors whose conductance is set to sz(i, j), sy(i, j), and c(i, j). In the spring-mass 
system analogy, the circles denote elevations and the black squares denote springs. The same 


graphical model can be used to depict a first-order Markov random field (Figure 4.12). 


2004), colorization (Levin, Lischinski, and Weiss 2004), and natural image matting (Levin, 
Lischinski, and Weiss 2008). 


Robust regularization 


While regularization is most commonly formulated using quadratic (L2) norms, i.e., the 
squared derivatives in (4.16—4.19) and squared differences in (4.24—4.25), it can also be for- 
mulated using the non-quadratic robust penalty functions first introduced in Section 4.1.3 and 


discussed in more detail in Appendix B.3. For example, (4.24) can be generalized to 


Em =D seli, p+ 1,4) - Fli i) 
ij (4.29) 


where p(x) is some monotonically increasing penalty function. For example, the family of 
norms p(x) = |x|? is called p-norms. When p < 2, the resulting smoothness terms become 
more piecewise continuous than totally smooth, which can better model the discontinuous 
nature of images, flow fields, and 3D surfaces. 

An early example of robust regularization is the graduated non-convexity (GNC) algo- 
rithm of Blake and Zisserman (1987). Here, the norms on the data and derivatives are 
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clamped, 
p(x) = min(a?, V). (4.30) 


Because the resulting problem is highly non-convex (it has many local minima), a continua- 
tion method is proposed, where a quadratic norm (which is convex) is gradually replaced by 
the non-convex robust norm (Allgower and Georg 2003). (Around the same time, Terzopou- 
los (1988) was also using continuation to infer the tear and crease variables in his surface 


interpolation problems.) 


4.2.2 Total variation 


Today, many regularized problems are formulated using the Lı (p = 1) norm, which is of- 
ten called total variation (Rudin, Osher, and Fatemi 1992; Chan, Osher, and Shen 2001; 
Chambolle 2004; Chan and Shen 2005; Tschumperlé and Deriche 2005; Tschumperlé 2006; 
Cremers 2007; Kaftory, Schechner, and Zeevi 2007; Kolev, Klodt et al. 2009; Werlberger, 
Pock, and Bischof 2010). The advantage of this norm is that it tends to better preserve dis- 
continuities, but still results in a convex problem that has a globally unique solution. Other 
norms, for which the influence (derivative) more quickly decays to zero, are presented by 
Black and Rangarajan (1996), Black, Sapiro et al. (1998), and Barron (2019) and discussed 
in Section 4.1.3 and Appendix B.3. 

Even more recently, hyper-Laplacian norms with p < 1 have gained popularity, based 
on the observation that the log-likelihood distribution of image derivatives follows a p = 
0.5 — 0.8 slope and is therefore a hyper-Laplacian distribution (Simoncelli 1999; Levin and 
Weiss 2007; Weiss and Freeman 2007; Krishnan and Fergus 2009). Such norms have an even 
stronger tendency to prefer large discontinuities over small ones. See the related discussion 
in Section 4.3 (4.43). 

While least squares regularized problems using Lz norms can be solved using linear sys- 
tems, other p-norms require different iterative techniques, such as iteratively reweighted least 
squares (IRLS), Levenberg—Marquardt, alternation between local non-linear subproblems 
and global quadratic regularization (Krishnan and Fergus 2009), or primal-dual algorithms 
(Chambolle and Pock 2011). Such techniques are discussed in Section 8.1.3 and Appendices 
A.3 and B.3. 


4.2.3 Bilateral solver 


In our discussion of variational methods, we have focused on energy minimization prob- 
lems based on gradients and higher-order derivatives, which in the discrete setting involves 
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evaluating weighted errors between neighboring pixels. As we saw previously in our dis- 
cussion of bilateral filtering in Section 3.3.2, we can often get better results by looking at 
a larger spatial neighborhood and combining pixels with similar colors or grayscale values. 
To extend this idea to a variational (energy minimization) setting, Barron and Poole (2016) 
propose replacing the usual first-order nearest-neighbor smoothness penalty (4.24) with a 


wider-neighborhood, bilaterally weighted version 
Es = Y ôli j k DIED = FG SNP, (4.31) 
ij kl 


where 
w(i, ds k, 1) 
nn w(i, J; mM, n) í 


is the bistochastized (normalized) version of the bilateral weight function given in (3.37), 


(i, j, k,l) = (4.32) 


which may depend on an input guide image, but not on the estimated values of f.° 

To efficiently solve the resulting set of equations (which are much denser than nearest- 
neighbor versions), the authors use the same approach originally used to accelerate bilateral 
filtering, i.e., solving a related problem on a (spatially coarser) bilateral grid. The sequence 
of operations resembles those used for bilateral filtering, except that after splatting and before 
slicing, an iterative least squares solver is used instead of a multi-dimensional Gaussian blur. 
To further speed up the conjugate gradient solver, Barron and Poole (2016) use a multi-level 
preconditioner inspired by previous work on image-adapted preconditioners (Szeliski 2006b; 
Krishnan, Fattal, and Szeliski 2013). 

Since its introduction, the bilateral solver has been used in a number of video process- 
ing and 3D reconstruction applications, including the stitching of binocular omnidirectional 
panoramic videos (Anderson, Gallup et al. 2016). The smartphone AR system developed 
by Valentin, Kowdle et al. (2018) extends the bilateral solver to have local planar models 
and uses a hardware-friendly real-time implementation (Mazumdar, Alaghi et al. 2017) to 


produce dense occlusion effects. 


4.2.4 Application: Interactive colorization 


A good use of edge-aware interpolation techniques is in colorization, 1.e., manually adding 
colors to a “black and white” (grayscale) image. In most applications of colorization, the 
user draws some scribbles indicating the desired colors in certain regions (Figure 4.10a) and 
the system interpolates the specified chrominance (u,v) values to the whole image, which 


5Note that in their paper, Barron and Poole (2016) use different o, values for the luminance and chrominance 


components of pixel color differences. 
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(a) (b) (c) 


Figure 4.10 Colorization using optimization (Levin, Lischinski, and Weiss 2004) © 2004 
ACM: (a) grayscale image with some color scribbles overlaid; (b) resulting colorized image; 
(c) original color image from which the grayscale image and the chrominance values for the 
scribbles were derived. Original photograph by Rotem Weiss. 


are then re-combined with the input luminance channel to produce a final colorized image, 
as shown in Figure 4.10b. In the system developed by Levin, Lischinski, and Weiss (2004), 
the interpolation is performed using locally weighted regularization (4.24), where the lo- 
cal smoothness weights are inversely proportional to luminance gradients. This approach 
to locally weighted regularization has inspired later algorithms for high dynamic range tone 
mapping (Lischinski, Farbman et al. 2006)(Section 10.2.1, as well as other applications of 
the weighted least squares (WLS) formulation (Farbman, Fattal et al. 2008). These tech- 
niques have benefitted greatly from image-adapted regularization techniques, such as those 
developed in Szeliski (2006b), Krishnan and Szeliski (2011), Krishnan, Fattal, and Szeliski 
(2013), and Barron and Poole (2016), as shown in Figure 4.11. An alternative approach to 
performing the sparse chrominance interpolation based on geodesic (edge-aware) distance 
functions has been developed by Yatziv and Sapiro (2006). Neural networks can also be used 
to implement deep priors for image colorization (Zhang, Zhu et al. 2017). 


4.3 Markov random fields 


As we have just seen, regularization, which involves the minimization of energy functionals 
defined over (piecewise) continuous functions, can be used to formulate and solve a variety 
of low-level computer vision problems. An alternative technique is to formulate a Bayesian 
or generative model, which separately models the noisy image formation (measurement) pro- 
cess, as well as assuming a Statistical prior model over the solution space (Bishop 2006, 
Section 1.5.4). In this section, we look at priors based on Markov random fields, whose 
log-likelihood can be described using local neighborhood interaction (or penalty) terms (Kin- 
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(d) 


Figure 4.11 Speeding up the inhomogeneous least squares colorization solver using locally 
adapted hierarchical basis preconditioning (Szeliski 2006b) © 2006 ACM: (a) input gray 
image with color strokes overlaid; (b) solution after 20 iterations of conjugate gradient; (c) 
using one iteration of hierarchical basis function preconditioning; (d) using one iteration of 
locally adapted hierarchical basis functions. 


dermann and Snell 1980; Geman and Geman 1984; Marroquin, Mitter, and Poggio 1987; Li 
1995; Szeliski, Zabih et al. 2008; Blake, Kohli, and Rother 2011). 

The use of Bayesian modeling has several potential advantages over regularization (see 
also Appendix B). The ability to model measurement processes statistically enables us to 
extract the maximum information possible from each measurement, rather than just guessing 
what weighting to give the data. Similarly, the parameters of the prior distribution can often 
be learned by observing samples from the class we are modeling (Roth and Black 2007a; 
Tappen 2007; Li and Huttenlocher 2008). Furthermore, because our model is probabilistic, 
it is possible to estimate (in principle) complete probability distributions over the unknowns 
being recovered and, in particular, to model the uncertainty in the solution, which can be 
useful in later processing stages. Finally, Markov random field models can be defined over 
discrete variables, such as image labels (where the variables have no proper ordering), for 
which regularization does not apply. 

According to Bayes’ rule (Appendix B.4), the posterior distribution p(x|y) over the un- 
knowns x given the measurements y can be obtained by multiplying the measurement likeli- 


hood p(y|x) by the prior distribution p(x) and normalizing, 


p(y|x)p(x) 


4.33 
ply) ' vee 


p(xly) = 


where p(y) = f, p(y|x)p(x) is a normalizing constant used to make the p(x|y) distribution 
proper (integrate to 1). Taking the negative logarithm of both sides of (4.33), we get 


— log p(x|y) = — log p(y|x) — log p(x) + C, (4.34) 


which is the negative posterior log likelihood. 
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To find the most likely (maximum a posteriori or MAP) solution x given some measure- 
ments y, we simply minimize this negative log likelihood, which can also be thought of as an 


energy, 


(We drop the constant C because its value does not matter during energy minimization.) 
The first term Ep(x, y) is the data energy or data penalty; it measures the negative log 
likelihood that the data were observed given the unknown state x. The second term Ep(x) is 
the prior energy; it plays a role analogous to the smoothness energy in regularization. Note 
that the MAP estimate may not always be desirable, as it selects the “peak” in the posterior 
distribution rather than some more stable statistic—see the discussion in Appendix B.2 and 
by Levin, Weiss et al. (2009). 

For the remainder of this section, we focus on Markov random fields, which are proba- 
bilistic models defined over two or three-dimensional pixel or voxel grids. Before we dive 
into this, however, we should mention that MRFs are just one special case of the more general 
family of graphical models (Bishop 2006, Chapter 8; Koller and Friedman 2009; Nowozin 
and Lampert 2011; Murphy 2012, Chapters 10, 17, 19), which have sparse interactions be- 
tween variables that can be captured in a factor graph (Dellaert and Kaess 2017; Dellaert 
2021), such as the one shown in Figure 4.12. Graphical models come in a wide variety of 
topologies, including chains (used for audio and speech processing), trees (often used for 
modeling kinematic chains in tracking people (e.g., Felzenszwalb and Huttenlocher 2005)), 
stars (simplified models for people; Dalal and Triggs 2005; Felzenszwalb, Girshick et al. 
2010, and constellations (Fergus, Perona, and Zisserman 2007). Such models were widely 
used for part-based recognition, as discussed in Section 6.2.1. For graphs that are acyclic, 
efficient linear-time inference algorithms based on dynamic programming can be used. 


For image processing applications, the unknowns x are the set of output pixels 
x = |f(0,0)... f(m-—1,n—1)), (4.36) 
and the data are (in the simplest case) the input pixels 
y =|[d(0,0)...d(m-— 1,n — 1)] (4.37) 


as shown in Figure 4.12. 
For a Markov random field, the probability p(x) is a Gibbs or Boltzmann distribution, 
whose negative log likelihood (according to the Hammersley—Clifford theorem) can be writ- 


ten as a sum of pairwise interaction potentials, 


Ep(x) = 5 V; jx E j), f(k, D), (4.38) 
{(i,9) (k, DJEN (i,j) 
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c(i.) 


fG j 


Figure 4.12 Graphical model for an N neighborhood Markov random field. (The blue 
edges are added for an Ng neighborhood.) The white circles are the unknowns f (i, j), while 
the dark circles are the input data d(i, j). The sy(i,j) and sy(i, j) black boxes denote arbi- 
trary interaction potentials between adjacent nodes in the random field, and the c(i, j) denote 
the data penalty functions. The same graphical model can be used to depict a discrete version 


of a first-order regularization problem (Figure 4.9). 


where N (i, j) denotes the neighbors of pixel (i, j). In fact, the general version of the theorem 
says that the energy may have to be evaluated over a larger set of cliques, which depend on 
the order of the Markov random field (Kindermann and Snell 1980; Geman and Geman 1984; 
Bishop 2006; Kohli, Ladický, and Torr 2009; Kohli, Kumar, and Torr 2009). 


The most commonly used neighborhood in Markov random field modeling is the M4 
neighborhood, where each pixel in the field f (i, j) interacts only with its immediate neigh- 
bors. The model in Figure 4.12, which we previously used in Figure 4.9 to illustrate the 
discrete version of first-order regularization, shows an N4 MRF. The sz(i, j) and sy (i, j) 
black boxes denote arbitrary interaction potentials between adjacent nodes in the random 
field and the c(i, j) denote the data penalty functions. These square nodes can also be in- 
terpreted as factors in a factor graph version of the (undirected) graphical model (Bishop 
2006; Dellaert and Kaess 2017; Dellaert 2021), which is another name for interaction poten- 
tials. (Strictly speaking, the factors are (improper) probability functions whose product is the 


(un-normalized) posterior distribution.) 


As we will see in (4.41-4.42), there is a close relationship between these interaction 
potentials and the discretized versions of regularized image restoration problems. Thus, to 
a first approximation, we can view energy minimization being performed when solving a 
regularized problem and the maximum a posteriori inference being performed in an MRF as 


equivalent. 
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While M4 neighborhoods are most commonly used, in some applications Ng (or even 
higher order) neighborhoods perform better at tasks such as image segmentation because 
they can better model discontinuities at different orientations (Boykov and Kolmogorov 2003; 
Rother, Kohli et al. 2009; Kohli, Ladicky, and Torr 2009; Kohli, Kumar, and Torr 2009). 


Binary MRFs 


The simplest possible example of a Markov random field is a binary field. Examples of such 
fields include 1-bit (black and white) scanned document images as well as images segmented 
into foreground and background regions. 

To denoise a scanned image, we set the data penalty to reflect the agreement between the 


scanned and final images, 
Ep (i,j) = w46(f (i, j), d(i, 5) (4.39) 


and the smoothness penalty to reflect the agreement between neighboring pixels 


Ep (i,j) a sô(f(i j), fhi + 1,5) + sô( fli, j), fli j + 1)). (4.40) 


Once we have formulated the energy, how do we minimize it? The simplest approach is 
to perform gradient descent, flipping one state at a time if it produces a lower energy. This ap- 
proach is known as contextual classification (Kittler and Föglein 1984), iterated conditional 
modes (ICM) (Besag 1986), or highest confidence first (HCF) (Chou and Brown 1990) if the 
pixel with the largest energy decrease is selected first. 

Unfortunately, these downhill methods tend to get easily stuck in local minima. An al- 
ternative approach is to add some randomness to the process, which is known as stochas- 
tic gradient descent (Metropolis, Rosenbluth et al. 1953; Geman and Geman 1984). When 
the amount of noise is decreased over time, this technique is known as simulated annealing 
(Kirkpatrick, Gelatt, and Vecchi 1983; Carnevali, Coletti, and Patarnello 1985; Wolberg and 
Pavlidis 1985; Swendsen and Wang 1987) and was first popularized in computer vision by 
Geman and Geman (1984) and later applied to stereo matching by Barnard (1989), among 
others. 

Even this technique, however, does not perform that well (Boykov, Veksler, and Zabih 
2001). For binary images, a much better technique, introduced to the computer vision com- 
munity by Boykov, Veksler, and Zabih (2001) is to re-formulate the energy minimization as 
a max-flow/min-cut graph optimization problem (Greig, Porteous, and Seheult 1989). This 
technique has informally come to be known as graph cuts in the computer vision community 
(Boykov and Kolmogorov 2011). For simple energy functions, e.g., those where the penalty 


for non-identical neighboring pixels is a constant, this algorithm is guaranteed to produce the 
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global minimum. Kolmogorov and Zabih (2004) formally characterize the class of binary 
energy potentials (regularity conditions) for which these results hold, while newer work by 
Komodakis, Tziritas, and Paragios (2008) and Rother, Kolmogorov et al. (2007) provide good 
algorithms for the cases when they do not, i.e., for energy functions that are not regular or 
sub-modular. 

In addition to the above mentioned techniques, a number of other optimization approaches 
have been developed for MRF energy minimization, such as (loopy) belief propagation and 
dynamic programming (for one-dimensional problems). These are discussed in more detail 
in Appendix B.5 as well as the comparative survey papers by Szeliski, Zabih et al. (2008) 
and Kappes, Andres et al. (2015), which have associated benchmarks and code at https: 
//vision.middlebury.edu/MRF and http://hciweb2.iwr.uni-heidelberg.de/opengm. 


Ordinal-valued MRFs 


In addition to binary images, Markov random fields can be applied to ordinal-valued labels 
such as grayscale images or depth maps. The term “ordinal” indicates that the labels have an 
implied ordering, e.g., that higher values are lighter pixels. In the next section, we look at 
unordered labels, such as source image labels for image compositing. 


In many cases, it is common to extend the binary data and smoothness prior terms as 


Ep (i,j) = cli, J)pa( f (i, j) — ali, 5) (4.41) 


and 


Ep(i, j) = Sali, J) Pp (f(t, j) ~~ FC F 1, 4)) + syli, j)Pp(f (i, j) = Fii F 1)), (4.42) 


which are robust generalizations of the quadratic penalty terms (4.26) and (4.24), first intro- 
duced in (4.29). As before, the c(i, j), Ss(i, j), and sy(i, j) weights can be used to locally 
control the data weighting and the horizontal and vertical smoothness. Instead of using a 
quadratic penalty, however, a general monotonically increasing penalty function p() is used. 
(Different functions can be used for the data and smoothness terms.) For example, pp can be 
a hyper-Laplacian penalty 

pp(d) = d p<1, (4.43) 


which better encodes the distribution of gradients (mainly edges) in an image than either a 
quadratic or linear (total variation) penalty. Levin and Weiss (2007) use such a penalty to 


Note that, unlike a quadratic penalty, the sum of the horizontal and vertical derivative p-norms is not rotationally 
invariant. A better approach may be to locally estimate the gradient direction and to impose different norms on the 
perpendicular and parallel components, which Roth and Black (2007b) call a steerable random field. 
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(c) (d) 


Figure 4.13 Grayscale image denoising and inpainting: (a) original image; (b) image 


(b) 


corrupted by noise and with missing data (black bar); (c) image restored using loopy belief 
propagation; (d) image restored using expansion move graph cuts. Images are from https: 
//vision.middlebury.edu/MRF/results (Szeliski, Zabih et al. 2008). 


separate a transmitted and reflected image (Figure 9.16) by encouraging gradients to lie in 
one or the other image, but not both. Levin, Fergus et al. (2007) use the hyper-Laplacian as a 
prior for image deconvolution (deblurring) and Krishnan and Fergus (2009) develop a faster 
algorithm for solving such problems. For the data penalty, pq can be quadratic (to model 
Gaussian noise) or the log of a contaminated Gaussian (Appendix B.3). 

When p, is a quadratic function, the resulting Markov random field is called a Gaussian 
Markov random field (GMRF) and its minimum can be found by sparse linear system solving 
(4.28). When the weighting functions are uniform, the GMRF becomes a special case of 
Wiener filtering (Section 3.4.1). Allowing the weighting functions to depend on the input 
image (a special kind of conditional random field, which we describe below) enables quite 
sophisticated image processing algorithms to be performed, including colorization (Levin, 
Lischinski, and Weiss 2004), interactive tone mapping (Lischinski, Farbman et al. 2006), 
natural image matting (Levin, Lischinski, and Weiss 2008), and image restoration (Tappen, 
Liu et al. 2007). 

When pa or pp are non-quadratic functions, gradient descent techniques such as non- 
linear least squares or iteratively re-weighted least squares can sometimes be used (Ap- 
pendix A.3). However, if the search space has lots of local minima, as is the case for stereo 
matching (Barnard 1989; Boykov, Veksler, and Zabih 2001), more sophisticated techniques 
are required. 

The extension of graph cut techniques to multi-valued problems was first proposed by 
Boykov, Veksler, and Zabih (2001). In their paper, they develop two different algorithms, 


called the swap move and the expansion move, which iterate among a series of binary labeling 
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(a) initial labeling (b) standard move (c) a-P-swap (d) a-expansion 


Figure 4.14 Multi-level graph optimization from Boykov, Veksler, and Zabih (2001) O 2001 
IEEE: (a) initial problem configuration; (b) the standard move only changes one pixel; (c) 
the o-B-swap optimally exchanges all a and B-labeled pixels; (d) the a-expansion move 


optimally selects among current pixel values and the o: label. 


sub-problems to find a good solution (Figure 4.14). Note that a global solution is generally not 
achievable, as the problem is provably NP-hard for general energy functions. Because both 
these algorithms use a binary MRF optimization inside their inner loop, they are subject to the 
kind of constraints on the energy functions that occur in the binary labeling case (Kolmogorov 
and Zabih 2004). 

Another MRF inference technique is belief propagation (BP). While belief propagation 
was originally developed for inference over trees, where it is exact (Pearl 1988), it has more 
recently been applied to graphs with loops such as Markov random fields (Freeman, Pasz- 
tor, and Carmichael 2000; Yedidia, Freeman, and Weiss 2001). In fact, some of the better 
performing stereo-matching algorithms use loopy belief propagation (LBP) to perform their 
inference (Sun, Zheng, and Shum 2003). LBP is discussed in more detail in comparative sur- 
vey papera on MRF optimization (Szeliski, Zabih et al. 2008; Kappes, Andres et al. 2015). 


Figure 4.13 shows an example of image denoising and inpainting (hole filling) using a 
non-quadratic energy function (non-Gaussian MRF). The original image has been corrupted 
by noise and a portion of the data has been removed (the black bar). In this case, the loopy 
belief propagation algorithm computes a slightly lower energy and also a smoother image 
than the alpha-expansion graph cut algorithm. 

Of course, the above formula (4.42) for the smoothness term Ep(i, j) just shows the 
simplest case. In follow-on work, Roth and Black (2009) propose a Field of Experts (FoE) 
model, which sums up a large number of exponentiated local filter outputs to arrive at the 
smoothness penalty. Weiss and Freeman (2007) analyze this approach and compare it to the 


simpler hyper-Laplacian model of natural image statistics. Lyu and Simoncelli (2009) use 
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Figure 4.15 Graphical model for a Markov random field with a more complex measure- 
ment model. The additional colored edges show how combinations of unknown values (say, 
in a sharp image) produce the measured values (a noisy blurred image). The resulting graph- 
ical model is still a classic MRF and is just as easy to sample from, but some inference 
algorithms (e.g., those based on graph cuts) may not be applicable because of the increased 
network complexity, since state changes during the inference become more entangled and the 


posterior MRF has much larger cliques. 


Gaussian Scale Mixtures (GSMs) to construct an inhomogeneous multi-scale MRF, with one 


(positive exponential) GMRF modulating the variance (amplitude) of another Gaussian MRF. 


It is also possible to extend the measurement model to make the sampled (noise-corrupted) 
input pixels correspond to blends of unknown (latent) image pixels, as in Figure 4.15. This is 
the commonly occurring case when trying to deblur an image. While this kind of a model is 
still a traditional generative Markov random field, i.e., we can in principle generate random 
samples from the prior distribution, finding an optimal solution can be difficult because the 
clique sizes get larger. In such situations, gradient descent techniques, such as iteratively 
reweighted least squares, can be used (Joshi, Zitnick et al. 2009). Exercise 4.4 has you 
explore some of these issues. 


Unordered labels 


Another case with multi-valued labels where Markov random fields are often applied is that of 
unordered labels, 1.e., labels where there is no semantic meaning to the numerical difference 
between the values of two labels. For example, if we are classifying terrain from aerial 
imagery, it makes no sense to take the numerical difference between the labels assigned to 


forest, field, water, and pavement. In fact, the adjacencies of these various kinds of terrain 
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Figure 4.16 An unordered label MRF (Agarwala, Dontcheva et al. 2004) O 2004 ACM: 
Strokes in each of the source images on the left are used as constraints on an MRF optimiza- 
tion, which is solved using graph cuts. The resulting multi-valued label field is shown as a 


color overlay in the middle image, and the final composite is shown on the right. 


each have different likelihoods, so it makes more sense to use a prior of the form 
Ep (i,j) = Soi, j)V (UG, j), Ui + 1, 9)) + syli, JV UG, i), Ui, j +1)), (4.44) 


where V (lo, 11) is a general compatibility or potential function. (Note that we have also re- 
placed f(i, j) with l(i, 7) to make it clearer that these are labels rather than function samples.) 
An alternative way to write this prior energy (Boykov, Veksler, and Zabih 2001; Szeliski, 
Zabih et al. 2008) is 

Ep= Y, Vpallp, lą), (4.45) 

(p,a) EN 

where the (p, q) are neighboring pixels and a spatially varying potential function Vp, q is eval- 
uated for each neighboring pair. 

An important application of unordered MRF labeling is seam finding in image composit- 
ing (Davis 1998; Agarwala, Dontcheva et al. 2004) (see Figure 4.16, which is explained in 
more detail in Section 8.4.2). Here, the compatibility V, q(Ip, lq) measures the quality of the 
visual appearance that would result from placing a pixel p from image lp next to a pixel q 
from image l4. As with most MRFs, we assume that V,, ¿(1, 1) = 0. For different labels, how- 
ever, the compatibility V, ¿(lp, lq) may depend on the values of the underlying pixels 1,, (p) 
and I), (q). 

Consider, for example, where one image Jo is all sky blue, i.e., Ion(p) = Io(q) = B, while 
the other image J; has a transition from sky blue, Jı (p) = B, to forest green, Iı (q) = G. 


rel E, 


In this case, V,, ¿(1, 0) = 0 (the colors agree), while Vp,4(0, 1) > 0 (the colors disagree). 
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4.3.1 Conditional random fields 


In a classic Bayesian model (4.334.359), 


p(xly) x p(y|x)p(x), (4.46) 


the prior distribution p(x) is independent of the observations y. Sometimes, however, it is 
useful to modify our prior assumptions, say about the smoothness of the field we are trying 
to estimate, in response to the sensed data. Whether this makes sense from a probability 
viewpoint is something we discuss once we have explained the new model. 

Consider an interactive image segmentation system such as the one described in Boykov 
and Funka-Lea (2006). In this application, the user draws foreground and background strokes, 
and the system then solves a binary MRF labeling problem to estimate the extent of the 
foreground object. In addition to minimizing a data term, which measures the pointwise 
similarity between pixel colors and the inferred region distributions (Section 4.3.2), the MRF 
is modified so that the smoothness terms s,(x,y) and s,(x,y) in Figure 4.12 and (4.42) 
depend on the magnitude of the gradient between adjacent pixels.’ 

Since the smoothness term now depends on the data, Bayes’ rule (4.46) no longer ap- 
plies. Instead, we use a direct model for the posterior distribution p(x|y), whose negative log 


likelihood can be written as 


E(xly) = Ep (x, y) + Es (x, y) 


= 5 Vp(£p,y) + 5 Vo,a (Tp, £q; y), (4.47) 
P (p.q)EN 

using the notation introduced in (4.45). The resulting probability distribution is called a con- 

ditional random field (CRF) and was first introduced to the computer vision field by Kumar 

and Hebert (2003), based on earlier work in text modeling by Lafferty, McCallum, and Pereira 

(2001). 

Figure 4.17 shows a graphical model where the smoothness terms depend on the data 
values. In this particular model, each smoothness term depends only on its adjacent pair of 
data values, i.e., terms are of the form V,, ¿(Tp, Lg, Yp, Yq) in (4.47). 

The idea of modifying smoothness terms in response to input data is not new. For exam- 
ple, Boykov and Jolly (2001) used this idea for interactive segmentation, and it is now widely 
used in image segmentation (Section 4.3.2) (Blake, Rother et al. 2004; Rother, Kolmogorov, 
and Blake 2004), denoising (Tappen, Liu et al. 2007), and object recognition (Section 6.4) 
(Winn and Shotton 2006; Shotton, Winn et al. 2009). 


7 An alternative formulation that also uses detected edges to modulate the smoothness of a depth or motion field 


and hence to integrate multiple lower level vision modules is presented by Poggio, Gamble, and Little (1988). 
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Figure 4.17 Graphical model for a conditional random field (CRF). The additional green 
edges show how combinations of sensed data influence the smoothness in the underlying MRF 
prior model, i.e., sa(i,j) and sy(i,j) in (4.42) depend on adjacent d(i,j) values. These 
additional links (factors) enable the smoothness to depend on the input data. However, they 


make sampling from this MRF more complex. 


In stereo matching, the idea of encouraging disparity discontinuities to coincide with 
intensity edges goes back even further to the early days of optimization and MRF-based 
algorithms (Poggio, Gamble, and Little 1988; Fua 1993; Bobick and Intille 1999; Boykov, 
Veksler, and Zabih 2001) and is discussed in more detail in (Section 12.5). 

In addition to using smoothness terms that adapt to the input data, Kumar and Hebert 
(2003) also compute a neighborhood function over the input data for each V, (tp, y) term, 
as illustrated in Figure 4.18, instead of using the classic unary MRF data term V,(2,, Yp) 
shown in Figure 4.12.8 Because such neighborhood functions can be thought of as dis- 
criminant functions (a term widely used in machine learning (Bishop 2006)), they call the 
resulting graphical model a discriminative random field (DRF). In their paper, Kumar and 
Hebert (2006) show that DRFs outperform similar CRFs on a number of applications, such 
as structure detection and binary image denoising. 

Here again, one could argue that previous stereo correspondence algorithms also look at 
a neighborhood of input data, either explicitly, because they compute correlation measures 
(Criminisi, Cross et al. 2006) as data terms, or implicitly, because even pixel-wise disparity 
costs look at several pixels in either the left or right image (Barnard 1989; Boykov, Veksler, 
and Zabih 2001). 


What then are the advantages and disadvantages of using conditional or discriminative 


8 Kumar and Hebert (2006) call the unary potentials Vp (up, y) association potentials and the pairwise potentials 


Vp,q(Xp, Yq, y) interaction potentials. 
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Figure 4.18 Graphical model for a discriminative random field (DRF). The additional 
green edges show how combinations of sensed data, e.g., d(i, j + 1), influence the data term 
for f(i j). The generative model is therefore more complex, i.e., we cannot just apply a 


simple function to the unknown variables and add noise. 


random fields instead of MRFs? 


Classic Bayesian inference (MRF) assumes that the prior distribution of the data is in- 
dependent of the measurements. This makes a lot of sense: if you see a pair of sixes when 
you first throw a pair of dice, it would be unwise to assume that they will always show up 
thereafter. However, if after playing for a long time you detect a statistically significant bias, 
you may want to adjust your prior. What CRFs do, in essence, is to select or modify the prior 
model based on observed data. This can be viewed as making a partial inference over addi- 
tional hidden variables or correlations between the unknowns (say, a label, depth, or clean 


image) and the knowns (observed images). 


In some cases, the CRF approach makes a lot of sense and is, in fact, the only plausible 
way to proceed. For example, in grayscale image colorization (Section 4.2.4) (Levin, Lischin- 
ski, and Weiss 2004), a commonly used way to transfer the continuity information from the 
input grayscale image to the unknown color image is to modify the local smoothness con- 
straints. Similarly, for simultaneous segmentation and recognition (Winn and Shotton 2006; 
Shotton, Winn et al. 2009), it makes a lot of sense to permit strong color edges to increase 


the likelihood of semantic image label discontinuities. 


In other cases, such as image denoising, the situation is more subtle. Using a non- 
quadratic (robust) smoothness term as in (4.42) plays a qualitatively similar role to setting 
the smoothness based on local gradient information in a Gaussian MRF (GMRF) (Tappen, 
Liu et al. 2007; Tanaka and Okutomi 2008). The advantage of Gaussian MRFs, when the 


smoothness can be correctly inferred, is that the resulting quadratic energy can be minimized 
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(a) Image (b) Unary classifiers (c) Robust P” CRF (d) Fully connected CRF. (e) Fully connected CRF. 
MCMC inference, 36 hrs our approach, 0.2 seconds 


Figure 4.19  Pixel-level classification with a fully connected CRF, from © Kréhenbiihl 
and Koltun (2011). The labels in each column describe the image or algorithm being run, 
which include a robust P” CRF (Kohli, Ladicky, and Torr 2009) and a very slow MCMC 


optimization algorithm. 


in a single step, 1.e., by solving a sparse set of linear equations. However, for situations where 
the discontinuities are not self-evident in the input data, such as for piecewise-smooth sparse 
data interpolation (Blake and Zisserman 1987; Terzopoulos 1988), classic robust smoothness 
energy minimization may be preferable. Thus, as with most computer vision algorithms, a 
careful analysis of the problem at hand and desired robustness and computation constraints 
may be required to choose the best technique. 

Perhaps the biggest advantage of CRFs and DRFs, as argued by Kumar and Hebert (2006), 
Tappen, Liu et al. (2007), and Blake, Rother et al. (2004), is that learning the model param- 
eters is more principled and sometimes easier. While learning parameters in MRFs and their 
variants is not a topic that we cover in this book, interested readers can find more details in 
publications by Kumar and Hebert (2006), Roth and Black (2007a), Tappen, Liu et al. (2007), 
Tappen (2007), and Li and Huttenlocher (2008). 


Dense Conditional Random Fields (CRFs) 


As with regular Markov random fields, conditional random fields (CRFs) are normally de- 
fined over small neighborhoods, e.g., the M4 neighborhood shown in Figure 4.17. However, 


images often contain longer-range interactions, e.g., pixels of similar colors may belong to 
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related classes (Figure 4.19). In order to model such longer-range interactions, Kráhenbiihl 
and Koltun (2011) introduced what they call a fully connected CRF, which many people now 
call a dense CRF. 

As with traditional conditional random fields (4.47), their energy function consists of both 


unary terms and pairwise terms 


E(xly) = Y Vp(ap,y) + X. Vpal(tp Bq Yp Ya)» (4.48) 
P (p,q) 


where the (p, q) summation is now taken over all pairs of pixels, and not just adjacent ones.” 


The y denotes the input (guide) image over which the random field is conditioned. The 
pairwise interaction potentials have a restricted form 


M 
Vag tps Lo Yp: Ya) = Play Eg) 5 SmWm(D, q) (4.49) 


m=1 


that is the product of a spatially invariant label compatibility function (xy, £4) and a sum of 
M Gaussian kernels of the same form (3.37) as is used in bilateral filtering and the bilateral 
solver. In their seminal paper, Kráhenbiihl and Koltun (2011) use two kernels, the first of 
which is an appearance kernel similar to (3.37) and the second is a spatial-only smoothness 
kernel. 

Because of the special form of the long-range interaction potentials, which encapsulate 
all spatial and color similarity terms into a bilateral form, higher-dimensional filtering al- 
gorithms similar to those used in fast bilateral filters and solvers (Adams, Baek, and Davis 
2010) can be used to efficiently compute a mean field approximation to the posterior condi- 
tional distribution (Krahenbiihl and Koltun 2011). Figure 4.19 shows a comparison of their 
results (rightmost column) with previous approaches, including using simple unary terms, a 
robust CRF (Kohli, Ladicky, and Torr 2009), and a very slow MCMC (Markov chain Monte 
Carlo) inference algorithm. As you can see, the fully connected CRF with a mean field solver 
produces dramatically better results in a very short time. 

Since the publication of this paper, provably convergent and more efficient inference al- 
gorithms have been developed both by the original authors (Krahenbiihl and Koltun 2013) 
and others (Vineet, Warrell, and Torr 2014; Desmaison, Bunel et al. 2016). Dense CRFs have 
seen widespread use in image segmentation problems and also as a “clean-up” stage for deep 


neural networks, as in the widely cited DeepLab paper by Chen, Papandreou et al. (2018). 


9In practice, as with bilateral filtering and the bilateral solver, the spatial extent may be over a large but finite 


region. 
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4.3.2 Application: Interactive segmentation 


The goal of image segmentation algorithms is to group pixels that have similar appearance 
(statistics) and to have the boundaries between pixels in different regions be of short length 
and across visible discontinuities. If we restrict the boundary measurements to be between 
immediate neighbors and compute region membership statistics by summing over pixels, we 
can formulate this as a classic pixel-based energy function using either a variational formu- 
lation (Section 4.2) or as a binary Markov random field (Section 4.3). 

Examples of the continuous approach include Mumford and Shah (1989), Chan and Vese 
(2001), Zhu and Yuille (1996), and Tabb and Ahuja (1997) along with the level set approaches 
discussed in Section 7.3.2. An early example of a discrete labeling problem that combines 
both region-based and boundary-based energy terms is the work of Leclerc (1989), who used 
minimum description length (MDL) coding to derive the energy function being minimized. 
Boykov and Funka-Lea (2006) present a wonderful survey of various energy-based tech- 
niques for binary object segmentation, some of which we discuss below. 

As we saw earlier in this chapter, the energy corresponding to a segmentation problem 
can be written (c.f. Equations (4.24) and (4.35-4.42)) as 


E(f) => Erli, j) + Ep(i, j), (4.50) 
ij 
where the region term 
Er(i, j) = CU (i, j); RFI) (4.51) 
is the negative log likelihood that pixel intensity (or color) I (i, j) is consistent with the statis- 


tics of region R(f (i, j)) and the boundary term 
Ep(i, j) = Sali, j) (i 7), FO 1,4)) + syl, HECE 9), Fi +1)) (4.52) 


measures the inconsistency between M4 neighbors modulated by local horizontal and vertical 
smoothness terms sz(i, j) and sy(i, j). 
Region statistics can be something as simple as the mean gray level or color (Leclerc 
1989), in which case 
C(I; pr) = (17 — ull’. (4.53) 


Alternatively, they can be more complex, such as region intensity histograms (Boykov and 
Jolly 2001) or color Gaussian mixture models (Rother, Kolmogorov, and Blake 2004). For 
smoothness (boundary) terms, it is common to make the strength of the smoothness sg(i, j) 
inversely proportional to the local edge strength (Boykov, Veksler, and Zabih 2001). 
Originally, energy-based segmentation problems were optimized using iterative gradient 


descent techniques, which were slow and prone to getting trapped in local minima. Boykov 
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(b) (c) 


Figure 4.20 = GrabCut image segmentation (Rother, Kolmogorov, and Blake 2004) © 2004 
ACM: (a) the user draws a bounding box in red; (b) the algorithm guesses color distribu- 
tions for the object and background and performs a binary segmentation; (c) the process is 
repeated with better region Statistics. 


and Jolly (2001) were the first to apply the binary MRF optimization algorithm developed by 
Greig, Porteous, and Seheult (1989) to binary object segmentation. 


In this approach, the user first delineates pixels in the background and foreground regions 
using a few strokes of an image brush. These pixels then become the seeds that tie nodes in 
the S—T graph to the source and sink labels S and T. Seed pixels can also be used to estimate 
foreground and background region statistics (intensity or color histograms). 


The capacities of the other edges in the graph are derived from the region and boundary 
energy terms, i.e., pixels that are more compatible with the foreground or background region 
get stronger connections to the respective source or sink; adjacent pixels with greater smooth- 
ness also get stronger links. Once the minimum-cut/maximum-flow problem has been solved 
using a polynomial time algorithm (Goldberg and Tarjan 1988; Boykov and Kolmogorov 
2004), pixels on either side of the computed cut are labeled according to the source or sink to 
which they remain connected. While graph cuts is just one of several known techniques for 
MRF energy minimization, it is still the one most commonly used for solving binary MRF 
problems. 

The basic binary segmentation algorithm of Boykov and Jolly (2001) has been extended 
in a number of directions. The GrabCut system of Rother, Kolmogorov, and Blake (2004) 
iteratively re-estimates the region statistics, which are modeled as a mixtures of Gaussians in 
color space. This allows their system to operate given minimal user input, such as a single 
bounding box (Figure 4.20a)—the background color model is initialized from a strip of pixels 
around the box outline. (The foreground color model is initialized from the interior pixels, 
but quickly converges to a better estimate of the object.) The user can also place additional 
strokes to refine the segmentation as the solution progresses. Cui, Yang et al. (2008) use color 
and edge models derived from previous segmentations of similar objects to improve the local 


models used in GrabCut. Graph cut algorithms and other variants of Markov and conditional 
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source 


(a) directed graph (b) image (c) undir. result (d) dir. result 


Figure 4.21 Segmentation with a directed graph cut (Boykov and Funka-Lea 2006) O 2006 
Springer: (a) directed graph; (b) image with seed points; (c) the undirected graph incorrectly 
continues the boundary along the bright object; (d) the directed graph correctly segments the 


light gray region from its darker surround. 


random fields have been applied to the semantic segmentation problem (Shotton, Winn et al. 
2009; Krahenbiihl and Koltun 2011), an example of which is shown in Figure 4.19 and which 
we study in more detail in Section 6.4. 


Another major extension to the original binary segmentation formulation is the addition of 
directed edges, which allows boundary regions to be oriented, e.g., to prefer light to dark tran- 
sitions or vice versa (Kolmogorov and Boykov 2005). Figure 4.21 shows an example where 
the directed graph cut correctly segments the light gray liver from its dark gray surround. The 
same approach can be used to measure the flux exiting a region, i.e., the signed gradient pro- 
jected normal to the region boundary. Combining oriented graphs with larger neighborhoods 
enables approximating continuous problems such as those traditionally solved using level sets 
in the globally optimal graph cut framework (Boykov and Kolmogorov 2003; Kolmogorov 
and Boykov 2005). 

More recent developments in graph cut-based segmentation techniques include the ad- 
dition of connectivity priors to force the foreground to be in a single piece (Vicente, Kol- 
mogorov, and Rother 2008) and shape priors to use knowledge about an object’s shape during 
the segmentation process (Lempitsky and Boykov 2007; Lempitsky, Blake, and Rother 2008). 

While optimizing the binary MRF energy (4.50) requires the use of combinatorial op- 
timization techniques, such as maximum flow, an approximate solution can be obtained by 
converting the binary energy terms into quadratic energy terms defined over a continuous 
[0, 1] random field, which then becomes a classical membrane-based regularization problem 
(4.24—4.27). The resulting quadratic energy function can then be solved using standard linear 
system solvers (4.27-4.28), although if speed is an issue, you should use multigrid or one 
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of its variants (Appendix A.5). Once the continuous solution has been computed, it can be 
thresholded at 0.5 to yield a binary segmentation. 

The [0, 1] continuous optimization problem can also be interpreted as computing the prob- 
ability at each pixel that a random walker starting at that pixel ends up at one of the labeled 
seed pixels, which is also equivalent to computing the potential in a resistive grid where the 
resistors are equal to the edge weights (Grady 2006; Sinop and Grady 2007). K-way seg- 
mentations can also be computed by iterating through the seed labels, using a binary problem 
with one label set to 1 and all the others set to O to compute the relative membership proba- 
bilities for each pixel. In follow-on work, Grady and Ali (2008) use a precomputation of the 
eigenvectors of the linear system to make the solution with a novel set of seeds faster, which 
is related to the Laplacian matting problem presented in Section 10.4.3 (Levin, Acha, and 
Lischinski 2008). Couprie, Grady et al. (2009) relate the random walker to watersheds and 
other segmentation techniques. Singaraju, Grady, and Vidal (2008) add directed-edge con- 
straints in order to support flux, which makes the energy piecewise quadratic and hence not 
solvable as a single linear system. The random walker algorithm can also be used to solve the 
Mumford-Shah segmentation problem (Grady and Alvino 2008) and to compute fast multi- 
grid solutions (Grady 2008). A nice review of these techniques is given by Singaraju, Grady 
et al. (2011). 

An even faster way to compute a continuous [0, 1] approximate segmentation is to com- 
pute weighted geodesic distances between the O and 1 seed regions (Bai and Sapiro 2009), 
which can also be used to estimate soft alpha mattes (Section 10.4.3). A related approach by 
Criminisi, Sharp, and Blake (2008) can be used to find fast approximate solutions to general 
binary Markov random field optimization problems. 


4.4 Additional reading 


Scattered data interpolation and approximation techniques are fundamental to many different 
branches of applied mathematics. Some good introductory texts and articles include Amidror 
(2002), Wendland (2004), and Anjyo, Lewis, and Pighin (2014). These techniques are also 
related to geometric modeling techniques in computer graphics, which continues to be a very 
active research area. A nice introduction to basic spline techniques for curves and surfaces 
can be found in Farin (2002), while more recent approaches using subdivision surfaces are 
covered in Peters and Reif (2008). 

Data interpolation and approximation also lie at the heart of regression techniques, which 
form the mathematical basis for most of the machine learning techniques we study in the next 


chapter. You can find good introductions to this topic (as well as underfitting, overfitting, 
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and model selection) in texts on classic machine learning (Bishop 2006; Hastie, Tibshirani, 
and Friedman 2009; Murphy 2012; Deisenroth, Faisal, and Ong 2020) and deep learning 
(Goodfellow, Bengio, and Courville 2016; Glassner 2018; Zhang, Lipton et al. 2021). 

Robust data fitting is also central to most computer vision problems. While introduced 
in this chapter, it is also revisited in Appendix B.3. Classic textbooks and articles on ro- 
bust fitting and statistics include Huber (1981), Hampel, Ronchetti ef al. (1986), Black and 
Rangarajan (1996), Rousseeuw and Leroy (1987), and Stewart (1999). The recent paper by 
Barron (2019) unifies many of the commonly used robust potential functions and shows how 
they can be used in machine learning applications. 

The regularization approach to computer vision problems was first introduced to the vi- 
sion community by Poggio, Torre, and Koch (1985) and Terzopoulos (1986a,b, 1988) and 
continues to be a popular framework for formulating and solving low-level vision problems 
(Ju, Black, and Jepson 1996; Nielsen, Florack, and Deriche 1997; Nordstróm 1990; Brox, 
Bruhn et al. 2004; Levin, Lischinski, and Weiss 2008). More detailed mathematical treatment 
and additional applications can be found in the applied mathematics and statistics literature 
(Tikhonov and Arsenin 1977; Engl, Hanke, and Neubauer 1996). 

Variational formulations have been extensively used in low-level computer vision tasks, 
including optical flow (Horn and Schunck 1981; Nagel and Enkelmann 1986; Black and 
Anandan 1993; Alvarez, Weickert, and Sánchez 2000; Brox, Bruhn et al. 2004; Zach, Pock, 
and Bischof 2007a; Wedel, Cremers et al. 2009; Werlberger, Pock, and Bischof 2010), seg- 
mentation (Kass, Witkin, and Terzopoulos 1988; Mumford and Shah 1989; Caselles, Kimmel, 
and Sapiro 1997; Paragios and Deriche 2000; Chan and Vese 2001; Osher and Paragios 2003; 
Cremers 2007), denoising (Rudin, Osher, and Fatemi 1992), stereo (Pock, Schoenemann et al. 
2008), multi-view stereo (Faugeras and Keriven 1998; Yezzi and Soatto 2003; Pons, Keriven, 
and Faugeras 2007; Labatut, Pons, and Keriven 2007; Kolev, Klodt et al. 2009), and scene 
flow (Wedel, Brox et al. 2011). 

The literature on Markov random fields is truly immense, with publications in related 
fields such as optimization and control theory of which few vision practitioners are even 
aware. A good guide to the latest techniques is the book edited by Blake, Kohli, and Rother 
(2011). Other articles that contain nice literature reviews or experimental comparisons in- 
clude Boykov and Funka-Lea (2006), Szeliski, Zabih et al. (2008), Kumar, Veksler, and Torr 
(2011), and Kappes, Andres ef al. (2015). MRFs are just one version of the more general 
topic of graphical models, which is covered in several textbooks and survey, including Bishop 
(2006, Chapter 8), Koller and Friedman (2009), Nowozin and Lampert (2011), and Murphy 
(2012, Chapters 10, 17, 19)). 


The seminal paper on Markov random fields is the work of Geman and Geman (1984), 
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who introduced this formalism to computer vision researchers and also introduced the no- 
tion of line processes, additional binary variables that control whether smoothness penalties 
are enforced or not. Black and Rangarajan (1996) showed how independent line processes 
could be replaced with robust pairwise potentials; Boykov, Veksler, and Zabih (2001) de- 
veloped iterative binary graph cut algorithms for optimizing multi-label MRFs; Kolmogorov 
and Zabih (2004) characterized the class of binary energy potentials required for these tech- 
niques to work; and Freeman, Pasztor, and Carmichael (2000) popularized the use of loopy 
belief propagation for MRF inference. Many more additional references can be found in 
Sections 4.3 and 4.3.2, and Appendix B.5. 

Continuous-energy-based (variational) approaches to interactive segmentation include Leclerc 
(1989), Mumford and Shah (1989), Chan and Vese (2001), Zhu and Yuille (1996), and Tabb 
and Ahuja (1997). Discrete variants of such problems are usually optimized using binary 
graph cuts or other combinatorial energy minimization methods (Boykov and Jolly 2001; 
Boykov and Kolmogorov 2003; Rother, Kolmogorov, and Blake 2004; Kolmogorov and 
Boykov 2005; Cui, Yang et al. 2008; Vicente, Kolmogorov, and Rother 2008; Lempitsky 
and Boykov 2007; Lempitsky, Blake, and Rother 2008), although continuous optimization 
techniques followed by thresholding can also be used (Grady 2006; Grady and Ali 2008; 
Singaraju, Grady, and Vidal 2008; Criminisi, Sharp, and Blake 2008; Grady 2008; Bai and 
Sapiro 2009; Couprie, Grady et al. 2009). Boykov and Funka-Lea (2006) present a good 


survey of various energy-based techniques for binary object segmentation. 


4.5 Exercises 


Ex 4.1: Data fitting (scattered data interpolation). Generate some random samples from 
a smoothly varying function and then implement and evaluate one or more data interpolation 


techniques. 


1. Generate a “random” 1-D or 2-D function by adding together a small number of sinu- 


soids or Gaussians of random amplitudes and frequencies or scales. 
2. Sample this function at a few dozen random locations. 


3. Fit a function to these data points using one or more of the scattered data interpolation 


techniques described in Section 4.1. 


4. Measure the fitting error between the estimated and original functions at some set of 


location, e.g., on a regular grid or at different random points. 
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5. Manually adjust any parameters your fitting algorithm may have to minimize the output 


sample fitting error, or use an automated technique such as cross-validation. 


6. Repeat this exercise with a new set of random input sample and output sample loca- 


tions. Does the optimal parameter change, and if so, by how much? 


7. (Optional) Generate a piecewise-smooth test function by using different random pa- 
rameters in different parts of of your image. How much more difficult does the data 


fitting problem become? Can you think of ways you might mitigate this? 


Try to implement your algorithm in NumPy (or Matlab) using only array operations, in or- 
der to become more familiar with data-parallel programming and the linear algebra operators 
built into these systems. Use data visualization techniques such as those in Figures 4.3-4.6 


to debug your algorithms and illustrate your results. 


Ex 4.2: Graphical model optimization. Download and test out the software on the OpenGM2 
library and benchmarks web site http://hciweb2.iwr.uni-heidelberg.de/opengm (Kappes, An- 
dres et al. 2015). Try applying these algorithms to your own problems of interest (segmenta- 
tion, de-noising, etc.). Which algorithms are more suitable for which problems? How does 


the quality compare to deep learning based approaches, which we study in the next chapter? 


Ex 4.3: Image deblocking—challenging. Now that you have some good techniques to dis- 
tinguish signal from noise, develop a technique to remove the blocking artifacts that occur 
with JPEG at high compression settings (Section 2.3.3). Your technique can be as simple 
as looking for unexpected edges along block boundaries, or looking at the quantization step 
as a projection of a convex region of the transform coefficient space onto the corresponding 


quantized values. 


1. Does the knowledge of the compression factor, which is available in the JPEG header 
information, help you perform better deblocking? See Ehrlich, Lim et al. (2020) for a 


recent paper on this topic. 


2. Because the quantization occurs in the DCT transformed YCbCr space (2.116), it may 
be preferable to perform the analysis in this space. On the other hand, image priors 
make more sense in an RGB space (or do they?). Decide how you will approach this 


dichotomy and discuss your choice. 


3. While you are at it, since the YCbCr conversion is followed by a chrominance subsam- 
pling stage (before the DCT), see if you can restore some of the lost high-frequency 
chrominance signal using one of the better restoration techniques discussed in this 


chapter. 


234 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


4. If your camera has a RAW + JPEG mode, how close can you come to the noise-free 
true pixel values? (This suggestion may not be that useful, since cameras generally use 
reasonably high quality settings for their RAW + JPEG models.) 


Ex 4.4: Inference in deblurring—challenging. Write down the graphical model correspond- 
ing to Figure 4.15 for a non-blind image deblurring problem, 1.e., one where the blur kernel 
1s known ahead of time. 

What kind of efficient inference (optimization) algorithms can you think of for solving 
such problems? 
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Figure 5.1 Machine learning and deep neural networks: (a) nearest neighbor classifica- 
tion © Glassner (2018); (b) Gaussian kernel support vector machine (Bishop 2006) © 2006 
Springer; (c) a simple three-layer network © Glassner (2018); (d) the SuperVision deep 
neural network, courtesy of Matt Deitke after (Krizhevsky, Sutskever, and Hinton 2012); (e) 
network accuracy vs. size and operation counts (Canziani, Culurciello, and Paszke 2017) © 
2017 IEEE; (f) visualizing network features (Zeiler and Fergus 2014) © 2014 Springer. 
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Machine learning techniques have always played an important and often central role in 
the development of computer vision algorithms. Computer vision in the 1970s grew out of the 
fields of artificial intelligence, digital image processing, and pattern recognition (now called 
machine learning), and one of the premier journals in our field (EEE Transactions on Pattern 


Analysis and Machine Intelligence) still bears testament to this heritage. 


The image processing, scattered data interpolation, variational energy minimization, and 
graphical model techniques introduced in the previous two chapters have been essential tools 
in computer vision over the last five decades. While elements of machine learning and pat- 
tern recognition have also been widely used, e.g., for fine-tuning algorithm parameters, they 
really came into their own with the availability of large-scale labeled image datasets, such 
as ImageNet (Deng, Dong et al. 2009; Russakovsky, Deng et al. 2015), COCO (Lin, Maire 
et al. 2014), and LVIS (Gupta, Dollár, and Girshick 2019). Currently, deep neural networks 
are the most popular and widely used machine learning models in computer vision, not just 
for semantic classification and segmentation, but even for lower-level tasks such as image 
enhancement, motion estimation, and depth recovery (Bengio, LeCun, and Hinton 2021). 

Figure 5.2 shows the main distinctions between traditional computer vision techniques, 
in which all of the processing stages were designed by hand, machine learning algorithms, in 
which hand-crafted features were passed on to a machine learning stage, and deep networks, 
in which all of the algorithm components, including mid-level representations, are learned 
directly from the training data. 

We begin this chapter with an overview of classical machine learning approaches, such 
as nearest neighbors, logistic regression, support vector machines, and decision forests. This 
1s a broad and deep subject, and we only provide a brief summary of the main popular ap- 
proaches. More details on these techniques can be found in textbooks on this subject, which 
include Bishop (2006), Hastie, Tibshirani, and Friedman (2009), Murphy (2012), Criminisi 
and Shotton (2013), and Deisenroth, Faisal, and Ong (2020). 

The machine learning part of the chapter focuses mostly on supervised learning for clas- 
sification tasks, in which we are given a collection of inputs {x;}, which may be features 
derived from input images, paired with their corresponding class labels (or targets) {t,;}, 
which come from a set of classes {Cx}. Most of the techniques described for supervised clas- 
sification can easily be extended to regression, i.e., associating inputs {x;} with real-valued 
scalar or vector outputs {y;}, which we have already studied in Section 4.1. We also look at 
some examples of unsupervised learning (Section 5.2), where there are no labels or outputs, 
as well as semi-supervised learning, in which labels or targets are only provided for a subset 


of the samples. 
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Hand-crafted Hand-crafted 
features algorithm 


Hand-crafted 
features 


(c) Deep learning pipeline 


Figure 5.2 Traditional, machine learning, and deep learning pipelines, inspired by Good- 
fellow, Bengio, and Courville (2016, Figure 1.5). In a classic vision pipeline such as struc- 
ture from motion, both the features and the algorithm were traditionally designed by hand 
(although learning techniques could be used, e.g., to design more repeatable features). Clas- 
sic machine learning approaches take extracted features and use machine learning to build a 
classifier. Deep learning pipelines learn the whole pipeline, starting from pixels all the way 
to outputs, using end-to-end training (indicated by the backward dashed arrows) to fine-tune 


the model parameters. 


The second half of this chapter focuses on deep neural networks, which, over the last 
decade, have become the method of choice for most computer vision recognition and lower- 
level vision tasks. We begin with the elements that make up deep neural networks, includ- 
ing weights and activations, regularization terms, and training using backpropagation and 
stochastic gradient descents. Next, we introduce convolutional layers, review some of the 
classic architectures, and talk about how to pre-train networks and visualize their perfor- 
mance. Finally, we briefly touch on more advanced networks, such as three-dimensional and 


spatio-temporal models, as well as recurrent and generative adversarial networks. 


Because machine learning and deep learning are such rich and deep topics, this chapter 
just briefly summarizes some of the main concepts and techniques. Comprehensive texts on 
classic machine learning include Bishop (2006), Hastie, Tibshirani, and Friedman (2009), 
Murphy (2012), and Deisenroth, Faisal, and Ong (2020) while textbooks focusing on deep 
learning include Goodfellow, Bengio, and Courville (2016), Glassner (2018), Glassner (2021), 
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Supervised 
learning 


Training inputs 


Training labels 


Figure 5.3 In supervised learning, paired training inputs and labels are used to estimate 
the model parameters that best predict the labels from their corresponding inputs. At run 
time, the model parameters are (usually) frozen, and the model is applied to new inputs to 


generate the desired outputs. © Zhang, Lipton et al. (2021, Figure 1.3) 


and Zhang, Lipton et al. (2021). 


5.1 Supervised learning 


Machine learning algorithms are usually categorized as either supervised, where paired inputs 
and outputs are given to the learning algorithm (Figure 5.3), or unsupervised, where statistical 
samples are provided without any corresponding labeled outputs (Section 5.2). 

As shown in Figure 5.3, supervised learning involves feeding pairs of inputs {x;} and 
their corresponding target output values {t;} into a learning algorithm, which adjusts the 
model’s parameters so as to maximize the agreement between the model’s predictions and 
the target outputs. The outputs can either be discrete labels that come from a set of classes 
{Ck y, or they can be a set of continuous, potentially vector-valued values, which we denote by 
y, to make the distinction between the two cases clearer. The first task is called classification, 
since we are trying to predict class membership, while the second is called regression, since 
historically, fitting a trend to data was called by that name (Section 4.1).! 

After a training phase during which all of the training data (labeled input-output pairs) 
have been processed (often by iterating over them many times), the trained model can now be 
used to predict new output values for previously unseen inputs. This phase is often called the 
test phase, although this sometimes fools people into focusing excessively on performance 
on a given test set, rather than building a system that works robustly for any plausible inputs 


that might arise. 


'Note that in software engineering, a regression sometimes means a change in the code that results in degraded 


performance. That is not the kind of regression we will be studying here. 
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In this section, we focus more on classification, since we’ve already covered some of the 
simpler (linear and kernel) methods for regression in the previous chapter. One of the most 
common applications of classification in computer vision is semantic image classification, 
where we wish to label a complete image (or predetermined portion) with its most likely 
semantic category, e.g., horse, cat, or car (Section 6.2). This is the main application for 
which deep networks (Sections 5.3-5.4) were originally developed. More recently, however, 
such networks have also been applied to continuous pixel labeling tasks such as semantic 
segmentation, image denoising, and depth and motion estimation. More sophisticated tasks, 
such as object detection and instance segmentation, will be covered in Chapter 6. 

Before we begin our review of traditional supervised learning techniques, we should de- 
fine a little more formally what the system is trying to learn, i.e., what we meant by “maximize 
the agreement between the model’s predictions and the target outputs.” Ultimately, like any 
other computer algorithm that will occasionally make mistakes under uncertain, noisy, and/or 
incomplete data, we would like to maximize its expected utility, or conversely, minimize its 
expected loss or risk. This is the subject of decision theory, which is explained in more 
detail in textbooks on machine learning (Bishop 2006, Section 1.5; Hastie, Tibshirani, and 
Friedman 2009, Section 2.4; Murphy 2012, Section 6.5; Deisenroth, Faisal, and Ong 2020, 
Section 8.2). 

We usually do not have access to the true probability distribution over the inputs, let alone 
the joint distribution over inputs and corresponding outputs. For this reason, we often use the 
training data distribution as a proxy for the real-world distribution. This approximation is 
known as empirical risk minimization (see above citations on decision theory), where the 
expected risk can be estimated with 


1 
Brisk(w) = 5D L(vi fi; w)). (5.1) 


The loss function L measures the “cost” of predicting an output £(x;;w) for input x; and 
model parameters w when the corresponding target is y;.? 

This formula should by now be quite familiar, since it is the same one we introduced in 
the previous chapter (4.2; 4.15) for regression. In those cases, the cost (penalty) is a simple 
quadratic or robust function of the difference between the target output y; and the output 
predicted by the model f(x;; w). In some situations, we may want the loss to model specific 
asymmetries in misprediction. For example, in autonomous navigation, it is usually more 
costly to over-estimate the distance to the nearest obstacle, potentially resulting in a collision, 


than to more conservatively under-estimate. We will see more examples of loss functions 


?In the machine learning literature, it is more common to write the loss using the letter L. But since we have used 


the letter E for energy (or summed error) in the previous chapter, we will stick to that notation throughout the book. 
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later on in this chapter, including Section 5.1.3 on Bayesian classification (5.19-5.24) and 
Section 5.3.4 on neural network loss (5.54—5.56). 

In classification tasks, it is common to minimize the misclassification rate, 1.e., penal- 
izing all class prediction errors equally using a class-agnostic delta function (Bishop 2006, 
Sections 1.5.1-1.5.2). However, asymmetries often exist. For example, the cost of produc- 
ing a false negative diagnosis in medicine, which may result in an untreated illness, is often 
greater than that of a false positive, which may suggest further tests. We will discuss true and 
false positives and negatives, along with error rates, in more detail in Section 7.1.3. 


Data preprocessing 


Before we start our review of widely used machine learning techniques, we should mention 
that it is usually a good idea to center, standardize, and if possible, whiten the input data 
(Glassner 2018, Section 10.5; Bishop 2006, Section 12.1.3). Centering the feature vectors 
means subtracting their mean value, while standardizing means also re-scaling each compo- 
nent so that its variance (average squared distance from the mean) is 1. 

Whitening is a more computationally expensive process, which involves computing the 
covariance matrix of the inputs, taking its SVD, and then rotating the coordinate system so 
that the final dimensions are uncorrelated and have unit variance (under a Gaussian model). 
While this may be quite practical and helpful for low-dimension inputs, it can become pro- 
hibitively expensive for large sets of images. (But see the discussion in Section 5.2.3 on 
principal component analysis, where it can be feasible and useful.) 

With this background in place, we now turn our attention to some widely used supervised 
learning techniques, namely nearest neighbors, Bayesian classification, logistic regression, 


support vector machines, and decision trees and forests. 


5.1.1 Nearest neighbors 


Nearest neighbors is a very simple non-parametric technique, i.e., one that does not involve 
a low-parameter analytic form for the underlying distribution. Instead, the training examples 
are all retained, and at evaluation time the “nearest” k neighbors are found and then averaged 
to produce the output.* 

Figure 5.4 shows a simple graphical example for various values of k, i.e., from using the 
k = 1 nearest neighbor all the way to finding the k = 25 nearest neighbors and selecting 


3The reason I put “nearest” in quotations is that standardizing and/or whitening the data will affect distances 


between vectors, and is usually helpful. 
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Figure 5.4 Nearest neighbor classification. To determine the class of the star (X) test 
sample, we find the k nearest neighbors and select the most popular class. This figure shows 
the results for k = 1, 9, and 25 samples. © Glassner (2018) 
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Figure 5.5 For noisy (intermingled) data, selecting too small a value of k results in ir- 


regular decision surfaces. Selecting too large a value can cause small regions to shrink or 
disappear. © Glassner (2018) 


the class with the highest count as the output label. As you can see, changing the number of 
neighbors affects the final class label, which changes from red to blue. 

Figure 5.5 shows the effect of varying the number of neighbors in another way. The left 
half of the figure shows the initial samples, which fall into either blue or orange categories. 
As you can see, the training samples are highly intermingled, 1.e., there is no clear (plausible) 
boundary that will correctly label all of the samples. The right side of this figure shows the 
decision boundaries for a k-NN classifier as we vary the values of k from 1 to 50. When k 
is too small, the classifier acts in a very random way, i.e., it is overfitting to the training data 
(Section 4.1.2). As k gets larger, the classifier underfits (over-smooths) the data, resulting in 
the shrinkage of the two smaller regions. The optimal number of nearest neighbors to use 
k is a hyperparameter for this algorithm. Techniques for determining a good value include 
cross-validation, which we discussed in Section 4.1.2. 


While nearest neighbors is a rather brute-force machine learning technique (although 
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Cover and Hart (1967) showed that it is statistically optimal in the large sample limit), but 
1t can still be useful in many computer vision applications, such as large-scale matching and 
indexing (Section 7.1.4). As the number of samples gets large, however, efficient techniques 
must be used to find the (exact or approximate) nearest neighbors. Good algorithms for find- 
ing nearest neighbors have been developed in both the general computer science and more 


specialized computer vision communities. 


Muja and Lowe (2014) developed a Fast Library for Approximate Nearest Neighbors 
(FLANN), which collects a number of previously developed algorithms and is incorporated 
as part of OpenCV. The library implements several powerful approximate nearest neighbor 
algorithms, including randomized k-d trees (Silpa-Anan and Hartley 2008), priority search 
k-means trees, approximate nearest neighbors (Friedman, Bentley, and Finkel 1977), and 
locality sensitive hashing (LSH) (Andoni and Indyk 2006). Their library can empirically 
determine which algorithm and parameters to use based on the characteristics of the data 


being indexed. 


More recently, Johnson, Douze, and Jégou (2021) developed the GPU-enabled Faiss li- 
brary* for scaling similarity search (Section 6.2.3) to billions of vectors. The library is based 
on product quantization (Jégou, Douze, and Schmid 2010), which had been shown by the 
authors to perform better than LSH (Gordo, Perronnin ef al. 2013) on the kinds of large-scale 
datasets the Faiss library was developed for. 


5.1.2 Bayesian classification 


For some simple machine learning problems, e.g., if we have an analytic model of feature 
construction and noising, or if we can gather enough samples, we can determine the prob- 
ability distributions of the feature vectors for each class p(x|C;,) as well as the prior class 
likelihoods p(C;,).> According to Bayes’ rule (4.33), the likelihood of class C% given a feature 
vector x (Figure 5.6) is given by 


p(x|Ck)p(Cr) 
= p(Cy|x) = == 5.2 
exp ly 
Se 5.3 
> ¡expl; Sd 


4https://github.com/facebookresearch/faiss 
5The following notation and equations are adapted from Bishop (2006, Section 4.2), which describes probabilistic 


generative classification. 
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class densities 


Figure 5.6 An example with two class conditional densities p(x|C;,) along with the corre- 
sponding posterior class probabilities p(Cp|x), which can be obtained using Bayes’ rule, i.e., 
by dividing by the sum of the two curves (Bishop 2006) © 2006 Springer. The vertical green 


line is the optimal decision boundary for minimizing the misclassification rate. 


where the second form (using the exp functions) is known as the normalized exponential or 


6 


softmax function.” The quantity 


ly, = log p(x|Ck) + log p(C;) (5.4) 


is the log-likelihood of sample x being from class Cy.” It is sometimes convenient to denote 


the softmax function (5.3) as a vector-to-vector valued function, 
p = softmax(1). (5.5) 


The softmax function can be viewed as a soft version of a maximum indicator function, 
which returns 1 for the largest value of ly whenever it dominates the other values. It is widely 
used in machine learning and statistics, including its frequent use as the final non-linearity in 
deep neural classification networks (Figure 5.27). 

The process of using formula (5.2) to determine the likelihood of a class Cy, given a 
feature vector x is known as Bayesian classification, since it combines a conditional feature 


likelihood p(x|C;,) with a prior distribution over classes p(C;,) using Bayes’ rule to determine 


SFor better numerical stability, it is common to subtract the largest value of | j from all of the input values so that 
the exponentials are in the range (0, 1] and there is less chance of roundoff error. 

7Some authors (e.g., Zhang, Lipton et al. 2021) use the term logit for the log-likelihood, although it is more 
commonly used to denote the log odds, discussed below, or the softmax function itself. 
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Figure 5.7 The logistic sigmoid function d(l), shown in red, along with a scaled error 
function, shown in dashed blue (Bishop 2006) © 2006 Springer. 


the posterior class probabilities. In the case where the components of the feature vector are 
generated independently, i.e., 


p(x/C;) = | [ v(z;[Cr), (5.6) 


the resulting technique is called a naive Bayes classifier. 


For the binary (two class) classification task, we can re-write (5.3) as 


COI) = ry =, (5.7) 
where l = lo — l; is the difference between the two class log likelihood and is known as the 
log odds or logit. 

The o(1) function is called the logistic sigmoid function (or simply the logistic function 
or logistic curve), where sigmoid means an S-shaped curve (Figure 5.7). The sigmoid was a 
popular activation function in earlier neural networks, although it has now been replaced by 


functions, as discussed in Section 5.3.2. 


Linear and quadratic discriminant analysis 


While probabilistic generative classification based on the normalized exponential and sigmoid 
can be applied to any set of log likelihoods, the formulas become much simpler when the 
distributions are multi-dimensional Gaussians. 


For Gaussians with identical covariance matrices X, we have 


1 1 1 
p(x|Ck) = (27)P2 | 5/11/2 exp { 5 (x Hr) Ex > 4u)-) (5.8) 
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(a) (b) 


Figure 5.8 Logistic regression for two identically distributed Gaussian classes (Bishop 
2006) © 2006 Springer: (a) two Gaussian distributions shown in red and blue; (b) the pos- 
terior probability p(Co|x), shown as both the height of the function and the proportion of red 
ink. 


In the case of two classes (binary classification), we obtain (Bishop 2006, Section 4.2.1) 
p(Co|x) =0(w*x + b), (5.9) 
with 


w= 5! (uo — mı), and (5.10) 


p(Co) 
p(C1)' 


Equation (5.9), which we will revisit shortly in the context of non-generative (discrimina- 


1 1 
b = 51 E ‘Mot supe + log (5.11) 


2 


tive) classification (5.18), is called logistic regression, since we pass the output of a linear 
regression formula 
I(x) = w?x+b (5.12) 


through the logistic function to obtain a class probability. Figure 5.8 illustrates this in two 
dimensions, there the posterior likelihood of the red class p(Co|x) is shown on the right side. 

In linear regression (5.12), w plays the role of the weight vector along which we project 
the feature vector x, and b plays the role of the bias, which determines where to set the 
classification boundary. Note that the weight direction (5.10) aligns with the vector join- 
ing the distribution means (after rotating the coordinates by the inverse covariance X71), 


while the bias term is proportional to the mean squared moments and the log class prior ratio 


log(p(Co)/p(C1)). 
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Figure 5.9 Quadratic discriminant analysis (Bishop 2006) O 2006 Springer. When the 
class covariances Xx, are different, the decision surfaces between Gaussian distributions 


become quadratic surfaces. 


For K > 2 classes, the softmax function (5.3) can be applied to the linear regression log 
likelihoods, 


In(x) = wex + bk, (5.13) 

with 
wy =D lp, and (5.14) 
bp = EE a + log p(Cx). (5,15) 


Because the decision boundaries along which the classification switches from one class 
to another are linear, 
WX + bk > w¡x + bi, (5.16) 


the technique of classifying examples using such criteria is known as linear discriminant 
analysis (Bishop 2006, Section 4.1; Murphy 2012, Section 4.2.2).8 

Thus far, we have looked at the case where all of the class covariance matrices Mz are 
identical. When they vary between classes, the decision surfaces are no longer linear and they 
become quadratic (Figure 5.9). The derivation of these quadratic decision surfaces is known 
as quadratic discriminant analysis (Murphy 2012, Section 4.2.1). 

In the case where Gaussian class distributions are not available, we can still find the best 
discriminant direction using Fisher discriminant analysis (Bishop 2006, Section 4.1.4; Mur- 
phy 2012, Section 8.6.3), as shown in Figure 5.10. Such analysis can be useful in separately 


8The acronym LDA is commonly used with linear discriminant analysis, but is sometimes also used for latent 


Dirichlet allocation in graphical models. 
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Figure 5.10 Fisher linear discriminant (Bishop 2006) © 2006 Springer. To find the projec- 
tion direction to best separate two classes, we compute the sum of the two class covariances 


and then use its inverse to rotate the vector between the two class means. 


modeling variability within different classes, e.g., the appearance variation of different people 
(Section 5.2.3). 


5.1.3 Logistic regression 


In the previous section, we derived classification rules based on posterior probabilities applied 
to multivariate Gaussian distributions. Quite often, however, Gaussians are not appropriate 
models of our class distributions and we must resort to alternative techniques. 

One of the simplest among these is logistic regression, which applies the same ideas as in 


the previous section, i.e., a linear projection onto a weight vector, 
l,=w-x;+) (5.17) 
followed by a logistic function 
pi = P(Co|xi) = o (l;) = o(w? x; +b) (5.18) 


to obtain (binary) class probabilities. Logistic regression is a simple example of a discrim- 
inative model, since it does not construct or assume a prior distribution over unknowns, and 
hence is not generative, 1.e., we cannot generate random samples from the class (Bishop 2006, 
Section 1.5.4). 

As we no longer have analytic estimates for the class means and covariances (or they are 
poor models of the class distributions), we need some other method to determine the weights 
w and bias b. We do this by maximizing the posterior log likelihoods of the correct labels. 

For the binary classification task, let t; € {0,1} be the class label for each training 
sample x; and p; = p(Co|x) be the estimated likelihood predicted by (5.18) for a given 
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weight and bias (w, b). We can maximize the likelihood of the correct labels being predicted 


by minimizing the negative log likelihood, i.e., the cross-entropy loss or error function, 


Ecu(w, b) =~ At log p; + (1 — t;) log(1 — p;)} (5.19) 


(Bishop 2006, Section 4.3.2).? Note how whenever the label t; = 0, we want p; = p(Co|x;) 
to be high, and vice versa. 
This formula can easily be extended to a multi-class loss by again defining the posterior 


probabilities as normalized exponentials over per-class linear regressions, as in (5.3) and 


(5.13), 
exp lik 


1 
Pik = P(Ck|xi) = Took gY lik, (5.20) 
j ij i 


with 
lik = WE Xi + bp. (5.21) 
The term Z; = > j exp lij can be a useful shorthand in derivations and is sometimes called 


the partition function. After some manipulation (Bishop 2006, Section 4.3.4), the correspond- 


ing multi-class cross-entropy loss (a.k.a. multinomial logistic regression objective) becomes 


Emcor([wr, br )) = 2 2 tir log Pik, (5.22) 


where the 1-of-K (or one-hot) encoding has ti, = lif sample 1 belongs to class k (and 0 
otherwise).!° It is more common to simply use the integer class value t; as the target, in 


which case we can re-write this even more succinctly as 


E({we, br)) = T log Pit,» (5.23) 


i.e., we simply sum up the log likelihoods of the correct class for each training sample. Sub- 


stituting the softmax formula (5.20) into this loss, we can re-write it as 


E({we, dry) = X. (log Z; — lit,)- (5.24) 


i 


2Note, however, that since this derivation is based on the assumption of Gaussian noise, it may not perform well 
if there are outliers, e.g., errors in the labels. In such a case, a more robust measure such as mean absolute error 
(MAE) may be preferable (Ghosh, Kumar, and Sastry 2017) or it may be necessary to re-weight the training samples 
(Ren, Zeng et al. 2018). 

l0This kind of representation can be useful if we wish the target classes to be a mixture, e.g., in the mixup data 


augmentation technique of Zhang, Cisse et al. (2018). 
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To determine the best set of weights and biases, {w;, bg}, we can use gradient descent, 
i.e., update their values using a Newton-Raphson second-order optimization scheme (Bishop 
2006, Section 4.3.3), 


w+ w-—H'VE(w), (5.25) 


where V E is the gradient of the loss function E with respect to the weight variables w, and H 
is the Hessian matrix of second derivatives of E. Because the cross-entropy functions are not 
linear in the unknown weights, we need to iteratively solve this equation a few times to arrive 
at a good solution. Since the elements in H are updated after each iteration, this technique 
is also known as iteratively reweighted least squares, which we will study in more detail in 
Section 8.1.4. While many non-linear optimization problems have multiple local minima, the 
cross-entropy functions described in this section do not, so we are guaranteed to arrive at a 
unique solution. 

Logistic regression does have some limitations, which is why it is often used for only 
the simplest classification tasks. If the classes in feature space are not linearly separable, 
using simple projections onto weight vectors may not produce adequate decision surfaces. 
In this case, kernel methods (Sections 4.1.1 and 5.1.4; Bishop 2006, Chapter 6; Murphy 
2012, Chapter 14), which measure the distances between new (test) feature vectors and select 
training examples, can often provide good solutions. 

Another problem with logistic regression is that if the classes actually are separable (either 
in the original feature space, or the lifted kernel space), there can be more than a single unique 
separating plane, as illustrated in Figure 5.1 1a. Furthermore, unless regularized, the weights 
w will continue to grow larger, as larger values of wy lead to larger pig values (once a 
separating plane has been found) and hence a smaller overall loss. 

For this reason, techniques that place the decision surfaces in a way that maximizes their 


separation to labeled examples have been developed, as we discuss next. 


5.1.4 Support vector machines 


As we have just mentioned, in some applications of logistic regression we cannot determine a 
single optimal decision surface (choice of weight and bias vectors (wz, by } in (5.21)) because 
there are gaps in the feature space where any number of planes could be introduced. Consider 
Figure 5.11a, where the two classes are denoted in cyan and magenta colors. In addition to the 
two dashed lines and the solid line, there are infinitely many other lines that will also cleanly 
separate the two classes, including a swath of horizontal lines. Since the classification error 
for any of these lines is zero, how can we choose the best decision surface, keeping in mind 


that we only have a limited number of training examples, and that actual run-time examples 
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(a) (b) 


Figure 5.11 (a) A support vector machine (SVM) finds the linear decision surface (hyper- 
plane) that maximizes the margin to the nearest training examples, which are called the sup- 
port vectors O Glassner (2018). (b) A two-dimensional two class example of a Gaussian 
kernel support vector machine (Bishop 2006) © 2006 Springer. The red and blue xs indicate 
the training samples, and the samples circled in green are the support vectors. The black lines 


indicate iso-contours of the kernel regression function, with the contours containing the blue 


and red support vectors indicating the +1 contours and the dark contour in between being 


the decision surface. 


may fall somewhere in between? 


The answer to this problem is to use maximum margin classifiers (Bishop 2006, Sec- 
tion 7.1), as shown in Figure 5.11a, where the dashed lines indicate two parallel decision sur- 
faces that have the maximum margin, i.e., the largest perpendicular distance between them. 
The solid line, which represents the hyperplane half-way between the dashed hyperplanes, is 
the maximum margin classifier. 


Why is this a good idea? There are several potential derivations (Bishop 2006, Sec- 
tion 7.1), but a fairly intuitive explanation is that there may be real-world examples coming 
from the cyan and magenta classes that we have not yet seen. Under certain assumptions, the 
maximum margin classifier provides our best bet for correctly classifying as many of these 
unseen examples as possible. 


To determine the maximum margin classifier, we need to find a weight-bias pair (w, b) 
for which all regression values l; = w - x; + b (5.17) have an absolute value of at least 1 as 


well as the correct sign. To denote this more compactly, let 


i; =2t,-1, # €{-1,1} (5.26) 
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be the signed class label. We can now re-write the inequality condition as 
ti(w +x; +b) > 1. (5.27) 


To maximize the margin, we simply find the smallest norm weight vector w that satisfies 


(5.27), i.e., we solve the optimization problem 
arg min wl? (5.28) 


subject to (5.27). This is a classic quadratic programming problem, which can be solved 
using the method of Lagrange multipliers, as described in Bishop (2006, Section 7.1). 


The inequality constraints are exactly satisfied, i.e., they turn into equalities, along the two 


dashed lines in Figure 5.11a, where we have l; = wx;+b = +1. The circled points that touch 
the dashed lines are called the support vectors.!! For a simple linear classifier, which can be 
denoted with a single weight and bias pair (w, b), there is no real advantage to computing the 
support vectors, except that they help us estimate the decision surface. However, as we will 
shortly see, when we apply kernel regression, having a small number of support vectors is a 
huge advantage. 

What happens if the two classes are not linearly separable, and in fact require a complex 
curved surface to correctly classify samples, as in Figure 5.11b? In this case, we can replace 
linear regression with kernel regression (4.3), which we introduced in Section 4.1.1. Rather 
than multiplying the weight vector w with the feature vector x, we instead multiply it with 


the value of K kernel functions centered at the data point locations Xx, 
li = f(x; w, b) = Y wed (||xs — xl) + b. (5.29) 
k 


This is where the power of support vector machines truly comes in. 

Instead of requiring the summation over all training samples xx, once we solve for the 
maximum margin classifier only a small subset of support vectors needs to be retained, as 
shown by the circled crosses in Figure 5.11b. As you can see in this figure, the decision 
boundary denoted by the dark black line nicely separates the red and blue class samples. Note 
that as with other applications of kernel regression, the width of the radial basis functions is 


still a free hyperparameter that must be reasonably tuned to avoid underfitting and overfitting. 


While the cyan and magenta dots may just look like points, they are, of course, schematic representations of 


higher-dimensional vectors lying in feature space. 


5.1 Supervised learning 253 


Figure 5.12 Support vector machine for overlapping class distributions (Bishop 2006) O 
2006 Springer. (a) The green circled point is on the wrong side of the y = 1 decision contour 
and has a penalty of £ = 1 — y > 0. (b) The “hinge” loss used in support vector machines 
is shown in blue, along with a rescaled version of the logistic regression loss function, shown 


in red, the misclassification error in black, and the squared error in green. 


Hinge loss. So far, we have focused on classification problems that are separable, i.e., for 
which a decision boundary exists that correctly classifies all the training examples. Support 
vector machines can also be applied to overlapping (mixed) class distributions (Figure 5.12a), 
which we previously approached using logistic regression. In this case, we replace the in- 


equality conditions (5.27), i.e., til; > 1, with a hinge loss penalty 


Pop (G,t) = [1 — tili], [L — ably, (5.30) 


where [+], denotes the positive part, i.e. [x]-+ = max(0, x). The hinge loss penalty, shown in 
blue in Figure 5.12b, is O whenever the (previous) inequality is satisfied and ramps up linearly 
depending on how much the inequality is violated. To find the optimal weight values (w, b), 
we minimize the regularized sum of hinge loss values, 


Esv(w, b) = X By (L(x; w, b), ĉi) + AllwI]?. (5.31) 


Figure 5.12b compares the hinge loss to the logistic regression (cross-entropy) loss in 
(5.19). The hinge loss imposes no penalty on training samples that are on the correct side of 
the |l;| > 1 boundary, whereas the cross-entropy loss prefers larger absolute values. While, 
in this section, we have focused on the two-class version of support vector machines, Bishop 
(2006, Chapter 7) describes the extension to multiple classes as well as efficient optimization 


algorithms such as sequential minimal optimization (SMO) (Platt 1989). There’s also a nice 
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12 


online tutorial on the scikit-learn website. ^ A survey of SVMs and other kernel methods 


applied to computer vision can be found in Lampert (2008). 


5.1.5 Decision trees and forests 


In contrast to most of the supervised learning techniques we have studied so far in this chapter, 
which process complete feature vectors all at once (with either linear projections or distances 
to training examples), decision trees perform a sequence of simpler operations, often just 
looking at individual feature elements before deciding which element to look at next (Hastie, 
Tibshirani, and Friedman 2009, Chapter 17; Glassner 2018, Section 14.5; Criminisi, Shot- 
ton, and Konukoglu 2012; Criminisi and Shotton 2013). (Note that the boosting approaches 
we study in Section 6.3.1 also use similar simple decision stumps.) While decision trees 
have been used in statistical machine learning for several decades (Breiman, Friedman et al. 
1984), the application of their more powerful extension, namely decision forests, only started 
gaining traction in computer vision a little over a decade ago (Lepetit and Fua 2006; Shotton, 
Johnson, and Cipolla 2008; Shotton, Girshick et al. 2013). Decision trees, like support vec- 
tor machines, are discriminative classifiers (or regressors), since they never explicitly form a 
probabilistic (generative) model of the data they are classifying. 

Figure 5.13 illustrates the basic concepts behind decision trees and random forests. In this 
example, training samples come from four different classes, each shown in a different color 
(a). A decision tree (b) is constructed top-to-bottom by selecting decisions at each node that 
split the training samples that have made it to that node into more specific (lower entropy) 
distributions. The thickness of each link shows the number of samples that get classified 
along that path, and the color of the link is the blend of the class colors that flow through that 
link. The color histograms show the class distributions at a few of the interior nodes. 

A random forest (c) is created by building a set of decision trees, each of which makes 
slightly different decisions. At test (classification) time, a new sample is classified by each of 
the trees in the random forest, and the class distributions at the final leaf nodes are averaged 
to provide an answer that is more accurate than could be obtained with a single tree (with a 
given depth). 

Random forests have several design parameters, which can be used to tailor their accuracy, 
generalization, and run-time and space complexity. These parameters include: 


e the depth of each tree D, 


¢ the number of trees T, and 


!2https://scikit-learn.org/stable/modules/svm.html#svm- classification 
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Figure 5.13 Decision trees and forests (Criminisi and Shotton 2013) © 2013 Springer. The 
top left figure (a) shows a set of training samples tags with four different class colors. The 
top right (b) shows a single decision tree with a distribution of classes at each node (the 
root node has the same distribution as the entire training set). During testing (c), each new 
example (feature vector) is tested at the root node, and depending on this test result (e.g., the 
comparison of some element to a threshold), a decision is made to walk down the tree to one 
of its children. This continues until a leaf node with a particular class distribution is reached. 
During training (b), decisions are selected such that they reduce the entropy (increase class 
specificity) at the node’s children. The bottom diagram (c) shows an ensemble of three trees. 
After a particular test example has been classified by each tree, the class distributions of the 


leaf nodes of all the constituent trees are averaged. 
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Figure 5.14 Random forest decision surfaces (Criminisi and Shotton 2013) © 2013 
Springer. Figures (a) and (b) show smaller and larger amounts of “noise” between the 
T = 400 tree forests obtained by using p = 500 and p = 5 random hypotheses at each 
split node. Withing each figure, the two rows show trees of different depths (D = 5 and 13), 
while the columns show the effects of using axis-aligned or linear decision surfaces (“weak 


learners” ). 


e the number of samples examined at node construction time p. 


By only looking at a random subset p of all the training examples, each tree ends up having 
different decision functions at each node, so that the ensemble of trees can be averaged to 
produce softer decision boundaries. 

Figure 5.14 shows the effects of some of these parameters on a simple four-class two- 
dimensional spiral dataset. In this figure, the number of trees has been fixed to T = 400. 
Criminisi and Shotton (2013, Chapter 4) have additional figures showing the effect of varying 
more parameters. The left (a) and right (b) halves of this figure show the effects of having 
less randomness (p = 500) and more randomness (p = 5) at the decision nodes. Less random 
trees produce sharper decision surfaces but may not generalize as well. Within each 2 x 2 
grid of images, the top row shows a shallower D = 5 tree, while the bottom row shows 
a deeper D = 13 tree, which leads to finer details in the decision boundary. (As with all 
machine learning, better performance on training data may not lead to better generalization 
because of overfitting.) Finally, the right column shows what happens if axis-aligned (single 


element) decisions are replaced with linear combinations of feature elements. 
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When applied to computer vision, decision trees first made an impact in keypoint recog- 
nition (Lepetit and Fua 2006) and image segmentation (Shotton, Johnson, and Cipolla 2008). 
They were one of the key ingredients (along with massive amounts of synthetic training data) 
in the breakthrough success of human pose estimation from Kinect depth images (Shotton, 
Girshick et al. 2013). They also led to state-of-the-art medical image segmentation systems 
(Criminisi, Robertson et al. 2013), although these have now been supplanted by deep neural 
networks (Kamnitsas, Ferrante et al. 2016). Most of these applications, along with additional 
ones, are reviewed in the book edited by Criminisi and Shotton (2013). 


5.2 Unsupervised learning 


Thus far in this chapter, we have focused on supervised learning techniques where we are 
given training data consisting of paired input and target examples. In some applications, 
however, we are only given a set of data, which we wish to characterize, e.g., to see if there 
are any patterns, regularities, or typical distributions. This is typically the realm of classical 
statistics. In the machine learning community, this scenario is usually called unsupervised 
learning, since the sample data comes without labels. Examples of applications in computer 
vision include image segmentation (Section 7.5) and face and body recognition and recon- 
struction (Sections 13.6.2). 

In this section, we look at some of the more widely used techniques in computer vision, 
namely clustering and mixture modeling (e.g., for segmentation) and principal component 
analysis (for appearance and shape modeling). Many other techniques are available, and 
are covered in textbooks on machine learning, such as Bishop (2006, Chapter 9), Hastie, 
Tibshirani, and Friedman (2009, Chapter 14), and Murphy (2012, Section 1.3). 


5.2.1 Clustering 


One of the simplest things you can do with your sample data is to group it into sets based on 
similarities (e.g., vector distances). In statistics, this problem is known as cluster analysis and 
is a widely studied area with hundreds of different algorithms (Jain and Dubes 1988; Kaufman 
and Rousseeuw 1990; Jain, Duin, and Mao 2000; Jain, Topchy et al. 2004). Murphy (2012, 
Chapter 25) has a nice exposition on clustering algorithms, including affinity propagation, 
spectral clustering, graph Laplacian, hierarchical, agglomerative, and divisive clustering. The 
survey by Xu and Wunsch (2005) is even more comprehensive, covering almost 300 different 
papers and such topics as similarity measures, vector quantization, mixture modeling, kernel 


methods, combinatorial and neural network algorithms, and visualization. Figure 5.15 shows 
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MiniBatchKMeans AffinityPropagation MeanShift SpectralClustering AgglomerativeClustering GaussianMixture 


Figure 5.15 Comparison of different clustering algorithms on some toy datasets, gen- 
erated using a simplified version of https://scikit-learn.org/stable/auto_examples/cluster/ 


plot_cluster_comparison.html#sphx- glr-auto-examples-cluster-plot-cluster-comparison-py. 


some of the algorithms implemented in the https://scikit-learn.org cluster analysis package 
applied to some simple two-dimensional examples. 

Splitting an image into successively finer regions (divisive clustering) is one of the oldest 
techniques in computer vision. Ohlander, Price, and Reddy (1978) present such a technique, 
which first computes a histogram for the whole image and then finds a threshold that best sep- 
arates the large peaks in the histogram. This process is repeated until regions are either fairly 
uniform or below a certain size. More recent splitting algorithms often optimize some metric 
of intra-region similarity and inter-region dissimilarity. These are covered in Sections 7.5.3 
and 4.3.2. 

Region merging techniques also date back to the beginnings of computer vision. Brice 
and Fennema (1970) use a dual grid for representing boundaries between pixels and merge 
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regions based on their relative boundary lengths and the strength of the visible edges at these 
boundaries. 

In data clustering, algorithms can link clusters together based on the distance between 
their closest points (single-link clustering), their farthest points (complete-link clustering), 
or something in between (Jain, Topchy ef al. 2004). Kamvar, Klein, and Manning (2002) 
provide a probabilistic interpretation of these algorithms and show how additional models 
can be incorporated within this framework. Applications of such agglomerative clustering 
(region merging) algorithms to image segmentation are discussed in Section 7.5. 

Mean-shift (Section 7.5.2) and mode finding techniques, such as k-means and mixtures of 
Gaussians, model the feature vectors associated with each pixel (e.g., color and position) as 
samples from an unknown probability density function and then try to find clusters (modes) 
in this distribution. 

Consider the color image shown in Figure 7.53a. How would you segment this image 
based on color alone? Figure 7.53b shows the distribution of pixels in L*u*v* space, which 
is equivalent to what a vision algorithm that ignores spatial location would see. To make the 
visualization simpler, let us only consider the L*u* coordinates, as shown in Figure 7.53c. 
How many obvious (elongated) clusters do you see? How would you go about finding these 
clusters? 

The k-means and mixtures of Gaussians techniques use a parametric model of the den- 
sity function to answer this question, i.e., they assume the density is the superposition of a 
small number of simpler distributions (e.g., Gaussians) whose locations (centers) and shape 
(covariance) can be estimated. Mean shift, on the other hand, smoothes the distribution and 
finds its peaks as well as the regions of feature space that correspond to each peak. Since a 


complete density is being modeled, this approach is called non-parametric (Bishop 2006). 


5.2.2 K-means and Gaussians mixture models 


K-means implicitly model the probability density as a superposition of spherically symmetric 
distributions and does not require any probabilistic reasoning or modeling (Bishop 2006). 
Instead, the algorithm is given the number of clusters k it is supposed to find and is initialized 
by randomly sampling & centers from the input feature vectors. It then iteratively updates 
the cluster center location based on the samples that are closest to each center (Figure 5.16). 
Techniques have also been developed for splitting or merging cluster centers based on their 
statistics, and for accelerating the process of finding the nearest mean center (Bishop 2006). 
In mixtures of Gaussians, each cluster center is augmented by a covariance matrix whose 
values are re-estimated from the corresponding samples (Figure 5.17). Instead of using near- 


est neighbors to associate input samples with cluster centers, a Mahalanobis distance (Ap- 
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Figure 5.16 The k-means algorithm starts with a set of samples and the number of desired 
clusters (in this case, k = 2) (Bishop 2006) O 2006 Springer. It iteratively assigns samples 


to the nearest mean, and then re-computes the mean center until convergence. 


n 


n 


Figure 5.17 Gaussian mixture modeling (GMM) using expectation maximization (EM) 
(Bishop 2006) © 2006 Springer. Samples are softly assigned to cluster centers based on 
their Mahalanobis distance (inverse covariance weighted distance), and the new means and 
covariances are recomputed based on these weighted assignments. 
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pendix B.1) is used: 
di, Hri Es) = lxs — Mello = (Xi — ae)” Ep (Xi — Hk) (5.32) 


where x; are the input samples, up are the cluster centers, and i, are their covariance es- 
timates. Samples can be associated with the nearest cluster center (a hard assignment of 
membership) or can be softly assigned to several nearby clusters. 

This latter, more commonly used, approach corresponds to iteratively re-estimating the 
parameters for a Gaussians mixture model, 


P(x| (mx, He, Er}) = Y eN (xl oe, Ex), (5.33) 
k 
where mp are the mixing coefficients, pp and >; are the Gaussian means and covariances, 
and 
N (x|ux, De) = oe tee) (5.34) 
pLa 


is the normal (Gaussian) distribution (Bishop 2006). 

To iteratively compute (a local) maximum likely estimate for the unknown mixture param- 
eters {7%, Hk, Ex y, the expectation maximization (EM) algorithm (Shlezinger 1968; Demp- 
ster, Laird, and Rubin 1977) proceeds in two alternating stages: 


1. The expectation stage (E step) estimates the responsibilities 
1 
Zik = ¿TR N(xlMx, Ex) with 2 za =1, (5.35) 


which are the estimates of how likely a sample x; was generated from the kth Gaussian 
cluster. 


2. The maximization stage (M step) updates the parameter values 


1 
Hi = y D ZikXi, (5.36) 
1 
Se 2 zin (Xi — pu) — pr)”, (5.37) 
N 
mE _ (5.38) 
where 


i 


is an estimate of the number of sample points assigned to each cluster. 
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Bishop (2006) has a wonderful exposition of both mixture of Gaussians estimation and the 
more general topic of expectation maximization. 

In the context of image segmentation, Ma, Derksen et al. (2007) present a nice review of 
segmentation using mixtures of Gaussians and develop their own extension based on Min- 
imum Description Length (MDL) coding, which they show produces good results on the 


Berkeley segmentation dataset. 


5.2.3 Principal component analysis 


As we just saw in mixture analysis, modeling the samples within a cluster with a multi- 
variate Gaussian can be a powerful way to capture their distribution. Unfortunately, as the 
dimensionality of our sample space increases, estimating the full covariance quickly becomes 
infeasible. 

Consider, for example, the space of all frontal faces (Figure 5.18). For an image consisting 
of P pixels, the covariance matrix has a size of P x P. Fortunately, the full covariance 
normally does not have to be modeled, since a lower-rank approximation can be estimated 
using principal component analysis, as described in Appendix A.1.2. 

PCA was originally used in computer vision for modeling faces, i.e., eigenfaces, initially 
for gray-scale images (Kirby and Sirovich 1990; Turk and Pentland 1991), and then for 3D 
models (Blanz and Vetter 1999; Egger, Smith et al. 2020) (Section 13.6.2) and active appear- 
ance models (Section 6.2.4), where they were also used to model facial shape deformations 
(Rowland and Perrett 1995; Cootes, Edwards, and Taylor 2001; Matthews, Xiao, and Baker 
2007). 


Eigenfaces. Eigenfaces rely on the observation first made by Kirby and Sirovich (1990) 
that an arbitrary face image x can be compressed and reconstructed by starting with a mean 


image m (Figure 6.1b) and adding a small number of scaled signed images uj, 
x=m+ J uu, (5.40) 


where the signed basis images (Figure 5.18b) can be derived from an ensemble of train- 
ing images using principal component analysis (also known as eigenvalue analysis or the 
Karhunen—Loéve transform). Turk and Pentland (1991) recognized that the coefficients a; in 
the eigenface expansion could themselves be used to construct a fast image matching algo- 


rithm. 
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(b) (c) (d) 


Figure 5.18 Face modeling and compression using eigenfaces (Moghaddam and Pentland 
1997) O 1997 IEEE: (a) input image; (b) the first eight eigenfaces; (c) image reconstructed 
by projecting onto this basis and compressing the image to 85 bytes; (d) image reconstructed 
using JPEG (530 bytes). 


In more detail, we start with a collection of training images {x,}, from which we compute 


the mean image m and a scatter or covariance matrix 


1 N-— 
-1 Se -m)?. (5.41) 


We can apply the eigenvalue decomposition (A.6) to represent this matrix as 
-1 
C=UAU? = X uu, (5.42) 


where the A; are the eigenvalues of C and the u; are the eigenvectors. For general im- 
ages, Kirby and Sirovich (1990) call these vectors eigenpictures; for faces, Turk and Pentland 
(1991) call them eigenfaces (Figure 5.18b).1% 

Two important properties of the eigenvalue decomposition are that the optimal (best ap- 
proximation) coefficients a; for any new image x can be computed as 


ai = (x — m) - uj, (5.43) 


and that, assuming the eigenvalues (A;) are sorted in decreasing order, truncating the ap- 
proximation given in (5.40) at any point M gives the best possible approximation (least 
error) between x and x. Figure 5.18c shows the resulting approximation corresponding to 
Figure 5.18a and shows how much better it is at compressing a face image than JPEG. 
Truncating the eigenface decomposition of a face image (5.40) after M components is 
equivalent to projecting the image onto a linear subspace F', which we can call the face space 


13In actual practice, the full P x P scatter matrix (5.41) is never computed. Instead, a smaller N x N matrix con- 
sisting of the inner products between all the signed deviations (x; — m) is accumulated instead. See Appendix A.1.2 
(A.13-A.14) for details. 
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Figure 5.19 Projection onto the linear subspace spanned by the eigenface images 
(Moghaddam and Pentland 1997) © 1997 IEEE. The distance from face space (DFFS) is 
the orthogonal distance to the plane, while the distance in face space (DIFS) is the distance 
along the plane from the mean image. Both distances can be turned into Mahalanobis dis- 


tances and given probabilistic interpretations. 


(Figure 5.19). Because the eigenvectors (eigenfaces) are orthogonal and of unit norm, the 


distance of a projected face x to the mean face m can be written as 


ma 32 
DIFS = ||x — m|| = y g l (5.44) 
i=0 
where DIFS stands for distance in face space (Moghaddam and Pentland 1997). The re- 
maining distance between the original image x and its projection onto face space X, i.e., the 
distance from face space (DFFS), can be computed directly in pixel space and represents the 
“faceness” of a particular image. Itis also possible to measure the distance between two 
different faces in face space by taking the norm of their eigenface coefficients difference. 
Computing such distances in Euclidean vector space, however, does not exploit the ad- 
ditional information that the eigenvalue decomposition of the covariance matrix (5.42) pro- 
vides. To properly weight the distance based on the measured covariance, we can use the 
Mahalanobis distance (5.32) (Appendix B.1). A similar analysis can be performed for com- 
puting a sensible difference from face space (DFFS) (Moghaddam and Pentland 1997) and 
the two terms can be combined to produce an estimate of the likelihood of being a true face, 
which can be useful in doing face detection (Section 6.3.1). More detailed explanations of 
probabilistic and Bayesian PCA can be found in textbooks on statistical learning (Bishop 
2006; Hastie, Tibshirani, and Friedman 2009; Murphy 2012), which also discuss techniques 


for selecting the optimum number of components M to use in modeling a distribution. 
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The original work on eigenfaces for recognition (Turk and Pentland 1991) was extended 
in Moghaddam and Pentland (1997), Heisele, Ho et al. (2003), and Heisele, Serre, and Poggio 
(2007) to include modular eigenenspaces for separately modeling the appearance of different 
facial components such as the eyes, nose, and mouth, as well as view-based eigenspaces to 
separately model different views of a face. It was also extended by Belhumeur, Hespanha, 
and Kriegman (1997) to handle appearance variation due to illumination, modeling intraper- 
sonal and extrapersonal variability separately, and using Fisher linear discriminant analysis 
(Figure 5.10) to perform recognition. A Bayesian extension of this work was subsequently 
developed by Moghaddam, Jebara, and Pentland (2000). These extensions are described in 
more detail in the cited papers, as well as the first edition of this book (Szeliski 2010, Sec- 
tion 14.2). 

It is also possible to generalize the bilinear factorization implicit in PCA and SVD ap- 
proaches to multilinear (tensor) formulations that can model several interacting factors si- 
multaneously (Vasilescu and Terzopoulos 2007). These ideas are related to additional topics 
in machine learning such as subspace learning (Cai, He et al. 2007), local distance functions 
(Frome, Singer et al. 2007; Ramanan and Baker 2009), and metric learning (Kulis 2013). 


5.2.4 Manifold learning 


In many cases, the data we are analyzing does not reside in a globally linear subspace, but 
does live on a lower-dimensional manifold. In this case, non-linear dimensionality reduction 
can be used (Lee and Verleysen 2007). Since these systems extract lower-dimensional man- 
ifolds in a higher-dimensional space, they are also known as manifold learning techniques 
(Zheng and Xue 2009). Figure 5.20 shows some examples of two-dimensional manifolds ex- 
tracted from the three-dimensional S-shaped ribbon using the scikit-learn manifold learning 
package.!* 

These results are just a small sample from the large number of algorithms that have been 
developed, which include multidimensional scaling (Kruskal 1964a,b), Isomap (Tenenbaum, 
De Silva, and Langford 2000), Local Linear Embedding (Roweis and Saul 2000), Hessian 
Eigenmaps (Donoho and Grimes 2003), Laplacian Eigenmaps (Belkin and Niyogi 2003), lo- 
cal tangent space alignment (Zhang and Zha 2004), Dimensionality Reduction by Learning 
an Invariant Mapping (Hadsell, Chopra, and LeCun 2006), Modified LLE (Zhang and Wang 
2007), t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton 
2008; van der Maaten 2014), and UMAP (McInnes, Healy, and Melville 2018). Many of 
these algorithms are reviewed in Lee and Verleysen (2007), Zheng and Xue (2009), and on 


'4https://scikit-learn.org/stable/modules/manifold.html 
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Figure 5.20 Examples of manifold learning, i.e., non-linear dimensionality reduction, ap- 


plied to 1,000 points with 10 neighbors each, from https://scikit-learn.org/stable/modules/ 
manifold.html. The eight sample outputs were produced by eight different embedding algo- 
rithms, as described in the scikit-learn manifold learning documentation page. 


Wikipedia." Bengio, Paiement et al. (2004) describe a method for extending such algo- 
rithms to compute the embedding of new (“out-of-sample”) data points. McQueen, Meila et 
al. (2016) describe their megaman software package, which can efficiently solve embedding 
problems with millions of data points. 

In addition to dimensionality reduction, which can be useful for regularizing data and 
accelerating similarity search, manifold learning algorithms can be used for visualizing in- 
put data distributions or neural network layer activations. Figure 5.21 show an example of 
applying two such algorithms (UMAP and t-SNE) to three different computer vision datasets. 


5.2.5 Semi-supervised learning 


In many machine learning settings, we have a modest amount of accurately labeled data and 
a far larger set of unlabeled or less accurate data. For example, an image classification dataset 
such as ImageNet may only contain one million labeled images, but the total number of 
images that can be found on the web is orders of magnitudes larger. Can we use this larger 
dataset, which still captures characteristics of our expect future inputs, to construct a better 
classifier or predictor? 


IShttps://en.wikipedia.org/wiki/Nonlinear-dimensionality reduction 
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Figure 5.21 Comparison of UMAP and t-SNE manifold learning algorithms © McInnes, 
Healy, and Melville (2018) on three different computer vision learning recognition tasks: 
COIL (Nene, Nayar, and Murase 1996), MNIST (LeCun, Cortes, and Burges 1998), and 
Fashion MNIST (Xiao, Rasul, and Vollgraf 2017). 


(a) (b) (c) 


Figure 5.22 Examples of semi-supervised learning (Zhu and Goldberg 2009) © 2009 Mor- 
gan & Claypool: (a) two labeled samples and a graph connecting all of the samples; (b) 
solving binary labeling with harmonic functions, interpreted as a resistive electrical network; 


(c) using semi-supervised support vector machine (S3VM). 
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Consider the simple diagrams in Figure 5.22. Even if only a small number of examples 
are labeled with the correct class (in this case, indicated by red and blue circles or dots), we 
can still imagine extending these labels (inductively) to nearby samples and therefore not only 
labeling all of the data, but also constructing appropriate decision surfaces for future inputs. 

This area of study is called semi-supervised learning (Zhu and Goldberg 2009; Subra- 
manya and Talukdar 2014). In general, it comes in two varieties. In transductive learning, 
the goal is to classify all of the unlabeled inputs that are given as one batch at the same time 
as the labeled examples, i.e., all of the dots and circles shown in Figure 5.22. In inductive 
learning, we train a machine learning system that will classify all future inputs, 1.e., all the 
regions in the input space. The second form is much more widely used, since in practice, 
most machine learning systems are used for online applications such as autonomous driving 
or new content classification. 

Semi-supervised learning is a subset of the larger class of weakly supervised learning 
problems, where the training data may not only be missing labels, but also have labels of 
questionable accuracy (Zhou 2018). Some early examples from computer vision (Torresani 
2014) include building whole image classifiers from image labels found on the internet (Fer- 
gus, Perona, and Zisserman 2004; Fergus, Weiss, and Torralba 2009) and object detection 
and/or segmentation (localization) with missing or very rough delineations in the training 
data (Nguyen, Torresani et al. 2009; Deselaers, Alexe, and Ferrari 2012). In the deep learn- 
ing era, weakly supervised learning continues to be widely used (Pathak, Krahenbuhl, and 
Darrell 2015; Bilen and Vedaldi 2016; Arandjelovic, Gronat et al. 2016; Khoreva, Benenson 
et al. 2017; Novotny, Larlus, and Vedaldi 2017; Zhai, Oliver et al. 2019). A recent example of 
weakly supervised learning being applied to billions of noisily labeled images is pre-training 
deep neural networks on Instagram images with hashtags (Mahajan, Girshick et al. 2018). We 
will look at weakly and self-supervised learning techniques for pre-training neural networks 
in Section 5.4.7. 


5.3 Deep neural networks 


As we saw in the introduction to this chapter (Figure 5.2), deep learning pipelines take an end- 
to-end approach to machine learning, optimizing every stage of the processing by searching 
for parameters that minimize the training loss. In order for such search to be feasible, it helps 
if the loss is a differentiable function of all these parameters. Deep neural networks provide a 
uniform, differentiable computation architecture, while also automatically discovering useful 
internal representations. 


Interest in building computing systems that mimic neural (biological) computation has 
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waxed and waned since the late 1950s, when Rosenblatt (1958) developed the perceptron 
and Widrow and Hoff (1960) derived the weight adaptation delta rule. Research into these 
topics was revitalized in the late 1970s by researchers who called themselves connectionists, 
organizing a series of meetings around this topic, which resulted in the foundation of the 
Neural Information Processing Systems (NeurIPS) conference in 1987. The recent book by 
Sejnowski (2018) has a nice historical review of this field’s development, as do the intro- 
ductions in Goodfellow, Bengio, and Courville (2016) and Zhang, Lipton et al. (2021), the 
review paper by Rawat and Wang (2017), and the Turing Award lecture by Bengio, LeCun, 
and Hinton (2021). And while most of the deep learning community has moved away from 
biologically plausible models, some research still studies the connection between biological 
visual systems and neural network models (Yamins and DiCarlo 2016; Zhuang, Yan et al. 
2020). 


A good collection of papers from this era can be found in McClelland, Rumelhart, and 
PDP Research Group (1987), including the seminal paper on backpropagation (Rumelhart, 
Hinton, and Williams 1986a), which laid the foundation for the training of modern feedfor- 
ward neural networks. During that time, and in the succeeding decades, a number of alter- 
native neural network architectures were developed, including ones that used stochastic units 
such as Boltzmann Machines (Ackley, Hinton, and Sejnowski 1985) and Restricted Boltz- 
mann Machines (Hinton and Salakhutdinov 2006; Salakhutdinov and Hinton 2009). The 
survey by Bengio (2009) has a review of some of these earlier approaches to deep learn- 
ing. Many of these architectures are examples of the generative graphical models we saw in 
Section 4.3. 


Today’s most popular deep neural networks are deterministic discriminative feedforward 
networks with real-valued activations, trained using gradient descent, i.e., the the backprop- 
agation training rule (Rumelhart, Hinton, and Williams 1986b). When combined with ideas 
from convolutional networks (Fukushima 1980; LeCun, Bottou et al. 1998), deep multi-layer 
neural networks produced the breakthroughs in speech recognition (Hinton, Deng et al. 2012) 
and visual recognition (Krizhevsky, Sutskever, and Hinton 2012; Simonyan and Zisserman 
2014b) seen in the early 2010s. Zhang, Lipton et al. (2021, Chapter 7) have a nice descrip- 
tion of the components that went into these breakthroughs and the rapid evolution in deep 
networks that has occurred since then, as does the earlier review paper by (Rawat and Wang 
2017). 


Compared to other machine learning techniques, which normally rely on several pre- 
processing stages to extract features on which classifiers can be built, deep learning ap- 
proaches are usually trained end-to-end, going directly from raw pixels to final desired out- 


puts (be they classifications or other images). In the next few sections, we describe the basic 
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Figure 5.23 A perceptron unit (a) explicitly showing the weights being multiplied by the 
inputs, (b) with the weights written on the input connections, and (c) the most common form, 
with the weights and bias omitted. A non-linear activation function follows the weighted 
summation. © Glassner (2018) 


components that go into constructing and training such neural networks. More detailed expla- 
nations on each topic can be found in textbooks on deep learning (Nielsen 2015; Goodfellow, 
Bengio, and Courville 2016; Glassner 2018, 2021; Zhang, Lipton et al. 2021) as well as the 
excellent course notes by Li, Johnson, and Yeung (2019) and Johnson (2020). 


5.3.1 Weights and layers 


Deep neural networks (DNNs) are feedforward computation graphs composed of thousands 
of simple interconnected “neurons” (units), which, much like logistic regression (5.18), per- 
form weighted sums of their inputs 


si = wi x; + bi (5.45) 
followed by a non-linear activation function re-mapping, 
yi = h(si), (5.46) 


as illustrated in Figure 5.23. The x; are the inputs to the ith unit, w; and b; are its learnable 
weights and bias, s; is the output of the weighted linear sum, and y, is the final output after s; 


is fed through the activation function h.'? The outputs of each stage, which are often called 


lóNote that we have switched to using s; for the weighted summations, since we will want to use 1 to index neural 
network layers. 
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Figure 5.24 A multi-layer network, showing how the outputs of one unit are fed into addi- 
tional units. O Glassner (2018) 


the activations, are then fed into units in later stages, as shown in Figure 5.24.!” 

The earliest such units were called perceptrons (Rosenblatt 1958) and were diagramed 
as shown in Figure 5.23a. Note that in this first diagram, the weights, which are optimized 
during the learning phase (Section 5.3.5), are shown explicitly along with the element-wise 
multiplications. Figure 5.23b shows a form in which the weights are written on top of the con- 
nections (arrows between units, although the arrowheads are often omitted). It is even more 
common to diagram nets as in Figure 5.23c, in which the weights (and bias) are completely 
omitted and assumed to be present. 

Instead of being connected into an irregular computation graph as in Figure 5.24, neural 
networks are usually organized into consecutive layers, as shown in Figure 5.25. We can 
now think of all the units within a layer as being a vector, with the corresponding linear 
combinations written as 

Sı, = Wix:, (5.47) 


where x; are the inputs to layer l, W; is a weight matrix, and s; is the weighted sum, to which 


an element-wise non-linearity is applied using a set of activation functions, 
X141 = yl = h(s;). (5.48) 


A layer in which a full (dense) weight matrix is used for the linear combination is called 


a fully connected (FC) layer, since all of the inputs to one layer are connected to all of its 


'TNote that while almost all feedforward neural networks use linear weighted summations of their inputs, the 
Neocognitron (Fukushima 1980) also included a divisive normalization stage inspired by the behavior of biological 
neurons. Some of the latest DNNs also support multiplicative interactions between activations using conditional 
batch norm (Section 5.3.3). 
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Figure 5.25 Two different ways to draw neural networks: (a) inputs at bottom, outputs at 
top, (b) inputs at left, outputs at right. © Glassner (2018) 


outputs. As we will see in Section 5.4, when processing pixels (or other signals), early stages 
of processing use convolutions instead of dense connections for both spatial invariance and 
better efficiency.!'* A network that consists only of fully connected (and no convolutional) 


layers is now often called a multi-layer perceptron (MLP). 


5.32 Activation functions 


Most early neural networks (Rumelhart, Hinton, and Williams 1986b; LeCun, Bottou et al. 
1998) used sigmoidal functions similar to the ones used in logistic regression. Newer net- 
works, starting with Nair and Hinton (2010) and Krizhevsky, Sutskever, and Hinton (2012), 
use Rectified Linear Units (ReLU) or variants. The ReLU activation function is defined as 


h(y) = max(0, y) (5.49) 


and is shown in the upper-left corner of Figure 5.26, along with some other popular functions, 
whose definitions can be found in a variety of publications (e.g., Goodfellow, Bengio, and 
Courville 2016, Section 6.3; Clevert, Unterthiner, and Hochreiter 2015; He, Zhang et al. 
2015) and the Machine Learning Cheatsheet. !° 

While the ReLU is currently the most popular activation function, a widely cited observa- 
tion in the CS231N course notes (Li, Johnson, and Yeung 2019) attributed to Andrej Karpathy 


'8Heads up for more confusing abbreviations: While a fully connected (dense) layer is often abbreviated as FC, a 
fully convolutional network, which is the opposite, i.e., sparsely connected with shared weights, is often abbreviated 
as FCN. 

1https://ml-cheatsheet.readthedocs.io/en/latest/activation-functions.html 
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Rau > Leaky Rell P Sred RaLU 


Figure 5.26 Some popular non-linear activation functions from © Glassner (2018): From 
top-left to bottom-right: ReLU, leaky ReLU, shifted ReLU, maxout, softplus, ELU, sigmoid, 
tanh, swish. 


warns that? 


Unfortunately, ReLU units can be fragile during training and can “die”. 
For example, a large gradient flowing through a ReLU neuron could cause the 
weights to update in such a way that the neuron will never activate on any data- 
point again. If this happens, then the gradient flowing through the unit will 
forever be zero from that point on. That is, the ReLU units can irreversibly die 
during training since they can get knocked off the data manifold. ... With a proper 
setting of the learning rate this is less frequently an issue. 


The CS231n course notes advocate trying some alternative non-clipping activation functions 


20http://cs23 1n.github.io/neural-networks- 1/Htactfun 
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ABCDEFGH 


(a) A softmax layer used to convert from neural network activations (“score”) 


to class likelihoods (b) The top row shows the activations, while the bottom shows the result 
of running the scores through softmax to obtain properly normalized likelihoods. O Glassner 
(2018). 


if this problem arises. 

For the final layer in networks used for classification, the softmax function (5.3) is nor- 
mally used to convert from real-valued activations to class likelihoods, as shown in Fig- 
ure 5.27. We can thus think of the penultimate set of neurons as determining directions in 
activation space that most closely match the log likelihoods of their corresponding class, 
while minimizing the log likelihoods of alternative classes. Since the inputs flow forward to 
the final output classes and probabilities, feedforward networks are discriminative, i.e., they 
have no statistical model of the classes they are outputting, nor any straightforward way to 


generate samples from such classes (but see Section 5.5.4 for techniques to do this). 


5.3.3 Regularization and normalization 


As with other forms of machine learning, regularization and other techniques can be used to 
prevent neural networks from overfitting so they can better generalize to unseen data. In this 
section, we discuss traditional methods such as regularization and data augmentation that can 
be applied to most machine learning systems, as well as techniques such as dropout and batch 


normalization, which are specific to neural networks. 


Regularization and weight decay 


As we saw in Section 4.1.1, quadratic or p-norm penalties on the weights (4.9) can be used 
to improve the conditioning of the system and to reduce overfitting. Setting p = 2 results in 


the usual Lə regularization and makes large weights smaller, whereas using p = 1 is called 
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Figure 5.28 An original “6” digit from the MNIST database and two elastically distorted 
versions (Simard, Steinkraus, and Platt 2003) O 2003 IEEE. 


lasso (least absolute shrinkage and selection operator) and can drive some weights all the way 
to zero. As the weights are being optimized inside a neural network, these terms make the 
weights smaller, so this kind of regularization is also known as weight decay (Bishop 2006, 
Section 3.1.4; Goodfellow, Bengio, and Courville 2016, Section 7.1; Zhang, Lipton et al. 
2021, Section 4.5).2! Note that for more complex optimization algorithms such as Adam, 
Lə regularization and weight decay are not equivalent, but the desirable properties of weight 


decay can be restored using a modified algorithm (Loshchilov and Hutter 2019). 


Dataset augmentation 


Another powerful technique to reduce over-fitting is to add more training samples by perturb- 
ing the inputs and/or outputs of the samples that have already been collected. This technique 
is known as dataset augmentation (Zhang, Lipton et al. 2021, Section 13.1) and can be partic- 
ularly effective on image classification tasks, since it is expensive to obtain labeled examples, 


and also since image classes should not change under small local perturbations. 


An early example of such work applied to a neural network classification task is the elastic 
distortion technique proposed by Simard, Steinkraus, and Platt (2003). In their approach, ran- 
dom low-frequency displacement (warp) fields are synthetically generated for each training 
example and applied to the inputs during training (Figure 5.28). Note how such distortions 
are not the same as simply adding pixel noise to the inputs. Instead, distortions move pixels 
around, and therefore introduce much larger changes in the input vector space, while still pre- 
serving the semantic meaning of the examples (in this case, MNIST digits (LeCun, Cortes, 
and Burges 1998)). 
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(a) Standard Neural Net (b) After applying dropout. 


Figure 5.29 When using dropout, during training some fraction of units p is removed from 
the network (or, equivalently, clamped to zero) © Srivastava, Hinton et al. (2014). Doing 
this randomly for each mini-batch injects noise into the training process (at all levels of the 


network) and prevents the network from overly relying on particular units. 


Dropout 


Dropout is a regularization technique introduced by Srivastava, Hinton et al. (2014), where 
at each mini-batch during training (Section 5.3.6), some percentage p (say 50%) of the units 
in each layer are clamped to zero, as shown in Figure 5.29. Randomly setting units to zero 
injects noise into the training process and also prevents the network from overly specializing 
units to particular samples or tasks, both of which can help reduce overfitting and improve 
generalization. 

Because dropping (zeroing out) p of the units reduces the expected value of any sum 
the unit contributes to by a fraction (1 — p), the weighted sums s; in each layer (5.45) are 
multiplied (during training) by (1 — p)~+. At test time, the network is run with no dropout 
and no compensation on the sums. A more detailed description of dropout can be found in 
Zhang, Lipton et al. (2021, Section 4.6) and Johnson (2020, Lecture 10). 


Batch normalization 


Optimizing the weights in a deep neural network, which we discuss in more detail in Sec- 
tion 5.3.6, is a tricky process and may be slow to converge. 
One of the classic problems with iterative optimization techniques is poor conditioning, 


where the components of the gradient vary greatly in magnitude. While it is sometimes 


21 From a Bayesian perspective, we can also think of this penalty as a Gaussian prior on the weight distribution 
(Appendix B.4). 
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possible to reduce these effects with preconditioning techniques that scale individual elements 
in a gradient before taking a step (Section 5.3.6 and Appendix A.5.2), it is usually preferable 
to control the condition number of the system during the problem formulation. 

In deep networks, one way in which poor conditioning can manifest itself is if the sizes 
of the weights or activations in successive layers become imbalanced. Say we take a given 
network and scale all of the weights in one layer by 100x and scale down the weights in the 
next layer by the same amount. Because the ReLU activation function is linear in both of 
its domains, the outputs of the second layer will still be the same, although the activations at 
the output of the first layer with be 100 times larger. During the gradient descent step, the 
derivatives with respect to the weights will be vastly different after this rescaling, and will in 
fact be opposite in magnitude to the weights themselves, requiring tiny gradient descent steps 
to prevent overshooting (see Exercise 5.4).2 

The idea behind batch normalization (loffe and Szegedy 2015) is to re-scale (and re- 
center) the activations at a given unit so that they have unit variance and zero mean (which, 
for a ReLU activation function, means that the unit will be active half the time). We perform 
this normalization by considering all of the training samples n in a given minibatch B (5.71) 


and computing the mean and variance statistics for unit 2 as 


1 n 
= ye” (5.50) 
| lee 
2= a — pi)? 5.51 
Oi > |B| Ss: Hi) ( . ) 
neB 
mo. 
ee ee ald (5.52) 


where sọ is the weighted sum of unit 2 for training sample n, al” 


i 


is the corresponding batch 
normalized sum, and e (often 1075) is a small constant to prevent division by zero. 

After batch normalization, the 3”) activations now have zero mean and unit variance. 
However, this normalization may run at cross-purpose to the minimization of the loss function 
during training. For this reason, Ioffe and Szegedy (2015) add an extra gain y; and bias ĝ; 


parameter to each unit ¿ and define the output of a batch normalization stage to be 


Yi = YiSi + Bi. (5.53) 


2This motivating paragraph is my own explanation of why batch normalization might be a good idea, and is 
related to the idea that batch normalization reduces internal covariate shift, used by (Ioffe and Szegedy 2015) to 
justify their technique. This hypothesis is now being questioned and alternative theories are being developed (Bjorck, 
Gomes et al. 2018; Santurkar, Tsipras et al. 2018; Kohler, Daneshmand et al. 2019). 


278 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


These parameters act just like regular weights, i.e., they are modified using gradient descent 
during training to reduce the overall training loss.” 

One subtlety with batch normalization is that the :; and o? quantities depend analytically 
on all of the activation for a given unit in a minibatch. For gradient descent to be properly 
defined, the derivatives of the loss function with respect to these variables, and the derivatives 
of the quantities $; and y; with respect to these variables, must be computed as part of the 
gradient computation step, using similar chain rule computations as the original backpropa- 
gation algorithm (5.65-5.68). These derivations can be found in Ioffe and Szegedy (2015) as 
well as several blogs.?* 

When batch normalization is applied to convolutional layers (Section 5.4), one could in 
principle compute a normalization separately for each pixel, but this would add a tremendous 
number of extra learnable bias and gain parameters (8;, yi). Instead, batch normalization is 
usually implemented by computing the statistics as sums over all the pixels with the same 
convolution kernel, and then adding a single bias and gain parameter for each convolution 
kernel (loffe and Szegedy 2015; Johnson 2020, Lecture 10; Zhang, Lipton et al. 2021, Sec- 
tion 7.5). 

Having described how batch normalization operates during training, we still need to de- 
cide what to do at test or inference time, i.e., when applying the trained network to new 
data. We cannot simply skip this stage, as the network was trained while removing common 
mean and variance estimates. For this reason, the mean and variance estimates are usually 
recomputed over the whole training set, or some running average of the per-batch statistics 
are used. Because of the linear form of (5.45) and (5.52-5.53), it is possible to fold the p; and 
g; estimates and learned (8;, yi) parameters into the original weight and bias terms in (5.45). 

Since the publication of the seminal paper by Ioffe and Szegedy (2015), a number of 
variants have been developed, some of which are illustrated in Figure 5.30. Instead of accu- 
mulating statistics over the samples in a minibatch B, we can compute them over different 


subsets of activations in a layer. These subsets include: 


e all the activations in a layer, which is called layer normalization (Ba, Kiros, and Hinton 
2016); 


e all the activations in a given convolutional output channel (see Section 5.4), which is 


called instance normalization (Ulyanov, Vedaldi, and Lempitsky 2017); 


There is a trick used by those in the know, which relies on the observation that any bias term b; in the original 
summation s; (5.45) shows up in the mean p; and gets subtracted out. For this reason, the bias term is often omitted 
when using batch (or other kinds of) normalization. 

*4https://kratzert.github.io/201 6/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer. 
html, https://kevinzakka. github.io/2016/09/14/batch_normalization, https://deepnotes.io/batchnorm 
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Batch Norm Layer Norm Instance Norm Group Norm 


Figure 5.30 Batch norm, layer norm, instance norm, and group norm, from Wu and He 
(2018) O 2018 Springer. The (H, W) dimension denotes pixels, C denotes channels, and N 
denotes training samples in a minibatch. The pixels in blue are normalized by the same mean 


and variance. 


e different sub-groups of output channels, which is called group normalization (Wu and 
He 2018). 


The paper by Wu and He (2018) describes each of these in more detail and also compares 
them experimentally. More recent work by Qiao, Wang et al. (2019a) and Qiao, Wang et al. 
(2019b) discusses some of the disadvantages of these newer variants and proposes two new 
techniques called weight standardization and batch channel normalization to mitigate these 
problems. 

Instead of modifying the activations in a layer using their statistics, it is also possible to 
modify the weights in a layer to explicitly make the weight norm and weight vector direction 
separate parameters, which is called weight normalization (Salimans and Kingma 2016). A 
related technique called spectral normalization (Miyato, Kataoka et al. 2018) constrains the 
largest singular value of the weight matrix in each layer to be 1. 

The bias and gain parameters (6;, yi) may also depend on the activations in some other 
layer in the network, e.g., derived from a guide image. Such techniques are referred to 
as conditional batch normalization and have been used to select between different artistic 
styles (Dumoulin, Shlens, and Kudlur 2017) and to enable local semantic guidance in image 
synthesis (Park, Liu et al. 2019). Related techniques and applications are discussed in more 
detail in Section 14.6 on neural rendering. 

The reasons why batch and other kinds of normalization help deep networks converge 
faster and generalize better are still being debated. Some recent papers on this topic include 
Bjorck, Gomes et al. (2018), Hoffer, Banner et al. (2018), Santurkar, Tsipras et al. (2018), 
and Kohler, Daneshmand et al. (2019). 


25Note that this gives neural networks the ability to multiply two layers in a network, which we used previously 


to perform locally (Section 3.5.5). 
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5.3.4 Loss functions 


In order to optimize the weights in a neural network, we need to first define a loss function 
that we minimize over the training examples. We have already seen the main loss functions 
used in machine learning in previous parts of this chapter. 

For classification, most neural networks use a final softmax layer (5.3), as shown in Fig- 
ure 5.27. Since the outputs are meant to be class probabilities that sum up to 1, it is natural to 
use the cross-entropy loss given in (5.19) or (5.23-5.24) as the function to minimize during 
training. Since in our description of the feedforward networks we have used indices 7 and 7 
to denote neural units, we will, in this section, use n to index a particular training example. 


The multi-class cross-entropy loss can thus be re-written as 
E(w) = Y E, (w) =- log pnt, (5.54) 


where w is the vector of all weights, biases, and other model parameters, and p,,, is the 
network’s current estimate of the probability of class k for sample n, and t,, is the integer 
denoting the correct class. Substituting the definition of p,,y from (5.20) with the appropriate 


replacement of lix with Sng (the notation we use for neural nets), we get 
En (w) = log Zn — Snt,, (5.55) 


with Zn =D ¡XP Snj- Gómez (2018) has a nice discussion of some of the losses widely 
used in deep learning. 

For networks that perform regression, i.e., generate one or more continuous variables such 
as depth maps or denoised images, it is common to use an Lz loss, 


E(w) = X Enfw) =- Y llyn - tall? (5.56) 


where yn is the network output for sample n and tn is the corresponding training (target) 
value, since this is a natural measure of error between continuous variables. However, if we 
believe there may be outliers in the training data, or 1f gross errors are not so harmful as to 
merit a quadratic penalty, more robust norms such as L; can be used (Barron 2019; Ranftl, 
Lasinger et al. 2020). (It is also possible to use robust norms for classification, e.g., adding 
an outlier probability to the class labels.) 

As it is common to interpret the final outputs of a network as a probability distribution, 
we need to ask whether it is wise to use such probabilities as a measure of confidence in a par- 
ticular answer. If a network is properly trained and predicting answers with good accuracy, it 


is tempting to make this assumption. The training losses we have presented so far, however, 
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only encourage the network to maximize the probability-weighted correct answers, and do 
not, in fact, encourage the network outputs to be properly confidence calibrated. Guo, Pleiss 
et al. (2017) discuss this issue, and present some simple measures, such as multiplying the 
log-likelihoods by a temperature (Platt 2000a), to improve the match between classifier prob- 
abilities and true reliability. The GrokNet image recognition system (Bell, Liu et al. 2020), 
which we discuss in Section 6.2.3, uses calibration to obtain better attribute probability esti- 
mates. 

For networks that hallucinate new images, e.g., when introducing missing high-frequency 
details (Section 10.3) or doing image transfer tasks (Section 14.6), we may want to use a 
perceptual loss (Johnson, Alahi, and Fei-Fei 2016; Dosovitskiy and Brox 2016; Zhang, Isola 
et al. 2018), which uses intermediate layer neural network responses as the basis of compar- 
ison between target and output images. It is also possible to train a separate discriminator 
network to evaluate the quality (and plausibility) of synthesized images, as discussed in Sec- 
tion 5.5.4 More details on the application of loss functions to image synthesis can be found 
in Section 14.6 on neural rendering. 

While loss functions are traditionally applied to supervised learning tasks, where the cor- 
rect label or target value t,, is given for each input, it is also possible to use loss functions in an 
unsupervised setting. An early example of this was the contrastive loss function proposed by 
Hadsell, Chopra, and LeCun (2006) to cluster samples that are similar together while spread- 
ing dissimilar samples further apart. More formally, we are given a set of inputs {x;} and 
pairwise indicator variables {t;; } that indicate whether two inputs are similar.”° The goal is 
now to compute an embedding v; for each input x; such that similar input pairs have similar 
embeddings (low distances), while dissimilar inputs have large embedding distances. Finding 
mappings or embeddings that create useful distances between samples is known as (distance) 
metric learning (Koóstinger, Hirzer et al. 2012; Kulis 2013) and is a commonly used tool in 
machine learning. The losses used to encourage the creation of such meaningful distances are 
collectively known as ranking losses (Gómez 2019) and can be used to relate features from 
different domains such as text and images (Karpathy, Joulin, and Fei-Fei 2014). 

The contrastive loss from (Hadsell, Chopra, and LeCun 2006) is defined as 


Eco = >> {ti log Ls(d;¿) + (1 — tig) log Lp (diz)}, (5.57) 
(4,j)EP 


where P is the set of all labeled input pairs, Ls and Lp are the similar and dissimilar loss 
functions, and d;; = ||v; — v;l| are the pairwise distance between paired embeddings.”’ 


26Indicator variables are often denoted as yij, but we will stick to the tij notation to be consistent with Sec- 
tion 5.1.3. 
27In metric learning, the embeddings are very often normalized to unit length. 
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This has a form similar to the cross-entropy loss given in (5.19), except that we measure 
squared distances between encodings v; and v;. In their paper, Hadsell, Chopra, and LeCun 
(2006) suggest using a quadratic function for Lg and a quadratic hinge loss (c.f. (5.30)) 
Lp = [m — dislí, for dissimilarity, where m is a margin beyond which there is no penalty. 

To train with a contrastive loss, you can run both pairs of inputs through the neural net- 
work, compute the loss, and then backpropagate the gradients through both instantiations 
(activations) of the network. This can also be thought of as constructing a Siamese network 
consisting of two copies with shared weights (Bromley, Guyon et al. 1994; Chopra, Hadsell, 
and LeCun 2005). It is also possible to construct a triplet loss that takes as input a pair of 
matching samples and a third non-matching sample and ensures that the distance between 
non-matching samples is greater than the distance between matches plus some margin (Wein- 
berger and Saul 2009; Weston, Bengio, and Usunier 2011; Schroff, Kalenichenko, and Philbin 
2015; Rawat and Wang 2017). 

Both pairwise contrastive and triplet losses can be used to learn embeddings for visual 
similarity search (Bell and Bala 2015; Wu, Manmatha et al. 2017; Bell, Liu et al. 2020), as 
discussed in more detail in Section 6.2.3. They have also been recently used for unsupervised 
pre-training of neural networks (Wu, Xiong et al. 2018; He, Fan et al. 2020; Chen, Kornblith 
et al. 2020), which we discuss in Section 5.4.7. In this case, it is more common to use a 
different contrastive loss function, inspired by softmax (5.3) and multi-class cross-entropy 
(5.20-5.22), which was first proposed by (Sohn 2016). Before computing the loss, the em- 
beddings are all normalized to unit norm, ||¥;||? = 1. Then, the following loss is summed 
over all matching embeddings, 

exp(Vi : ¥;/T) 


lij = log ; k 
5 08S expli 4/7) Pee 


with the denominator summed over non-matches as well. The 7 variable denotes the “temper- 


ature” and controls how tight the clusters will be; it is sometimes replaced with an s multiplier 
parameterizing the hyper-sphere radius (Deng, Guo et al. 2019). The exact details of how the 
matches are computed vary by exact implementation. 

This loss goes by several names, including InfoNCE (Oord, Li, and Vinyals 2018), and 
NT-Xent (normalized temperature cross-entropy loss) in Chen, Kornblith et al. (2020). Gen- 
eralized versions of this loss called SphereFace, CosFace, and ArcFace are discussed and 
compared in the ArcFace paper (Deng, Guo et al. 2019) and used by Bell, Liu et al. (2020) 
as part of their visual similarity search system. The smoothed average precision loss recently 
proposed by Brown, Xie et al. (2020) can sometimes be used as an alternative to the met- 
ric losses discussed in this section. Some recent papers that compare and discuss deep metric 


learning approaches include (Jacob, Picard et al. 2019; Musgrave, Belongie, and Lim 2020). 


5.3 Deep neural networks 283 
Weight initialization 


Before we can start optimizing the weights in our network, we must first initialize them. Early 
neural networks used small random weights to break the symmetry, i.e., to make sure that all 
of the gradients were not zero. It was observed, however, that in deeper layers, the activations 
would get progressively smaller. 

To maintain a comparable variance in the activations of successive layers, we must take 
into account the fan-in of each layer, i.e., the number of incoming connections where activa- 
tions get multiplied by weights. Glorot and Bengio (2010) did an initial analysis of this issue, 
and came up with a recommendation to set the random initial weight variance as the inverse 
of the fan-in. Their analysis, however, assumed a linear activation function (at least around 
the origin), such as a tanh function. 

Since most modern deep neural networks use the ReLU activation function (5.49), He, 
Zhang et al. (2015) updated this analysis to take into account this asymmetric non-linearity. 
If we initialize the weights to have zero mean and variance V; for layer l and set the original 
biases to zero, the linear summation in (5.45) will have a variance of 


Var[s¡] = mViE[z7], (5.59) 


where ry is the number of incoming activations/weights and E [x7] is the expectation of the 
squared incoming activations. When the summations s;, which have zero mean, are fed 
through the ReLU, the negative ones will get clamped to zero, so the expectation of the 
squared output E[y?] is half the variance of s;, Var[s]]. 

In order to avoid decaying or increasing average activations in deeper layers, we want the 


magnitude of the activations in successive layers to stay about the same. Since we have 
2 1 1 2 
Ely] = 5 Varlsi] = ¿mV Blzl, (5.60) 
we conclude that the variance in the initial weights V; should be set to 
V =—, (5.61) 


i.e., the inverse of half the fan-in of a given unit or layer. This weight initialization rule is 
commonly called He initialization. 

Neural network initialization continues to be an active research area, with publications 
that include Krähenbühl, Doersch et al. (2016), Mishkin and Matas (2016), Frankle and 
Carbin (2019), and Zhang, Dauphin, and Ma (2019) 
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5.3.5 Backpropagation 


Once we have set up our neural network by deciding on the number of layers, their widths 
and depths, added some regularization terms, defined the loss function, and initialized the 
weights, we are ready to train the network with our sample data. To do this, we use gradient 
descent or one of its variants to iteratively modify the weights until the network has converged 
to a good set of values, i.e., an acceptable level of performance on the training and validation 
data. 

To do this, we compute the derivatives (gradients) of the loss function E, for training 
sample n with respect to the weights w using the chain rule, starting with the outputs and 
working our way back through the network towards the inputs, as shown in Figure 5.31. This 
procedure is known as backpropagation (Rumelhart, Hinton, and Williams 1986b) and stands 
for backward propagation of errors. You can find alternative descriptions of this technique in 
textbooks and course notes on deep learning, including Bishop (2006, Section 5.3.1), Good- 
fellow, Bengio, and Courville (2016, Section 6.5), Glassner (2018, Chapter 18), Johnson 
(2020, Lecture 6), and Zhang, Lipton et al. (2021). 


Recall that in the forward (evaluation) pass of a neural network, activations (layer outputs) 
are computed layer-by-layer, starting with the first layer and finishing at the last. We will see 
in the next section that many newer DNNs have an acyclic graph structure, as shown in 
Figures 5.42-5.43, rather than just a single linear pipeline. In this case, any breadth-first 
traversal of the graph can be used. The reason for this evaluation order is computational 
efficiency. Activations need only be computed once for each input sample and can be re-used 
in succeeding stages of computation. 

During backpropagation, we perform a similar breadth-first traversal of the reverse graph. 
However, instead of computing activations, we compute derivatives of the loss with respect 
to the weights and inputs, which we call errors. Let us look at this in more detail, starting 


with the loss function. 


The derivative of the cross-entropy loss En (5.54) with respect to the output probability 
Pnk is simply —ônt,„ /Pnx. What is more interesting is the derivative of the loss with respect 


to the scores Sng going into the softmax layer (5.55) shown in Figure 5.27, 


OEn 


1 4 
ð H eXP Snk = Pnk — Ont, = Pnk — tnk- (5.62) 
Snk 


Ln 


(The last form is useful if we are using one-hot encoding or the targets have non-binary 
probabilities.) This has a satisfyingly intuitive explanation as the difference between the 


predicted class probability p,, and the true class identity tnk. 
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Figure 5.31 Backpropagating the derivatives (errors) through an intermediate layer of the 
deep network O Glassner (2018). The derivatives of the loss function applied to a single 
training example with respect to each of the pink unit inputs are summed together and the 
process is repeated chaining backward through the network. 


For the Lə loss in (5.56), we get a similar result, 


OEn, 
OYnk 


= Ynk — tnk, (5.63) 


which in this case denotes the real-valued difference between the predicted and target values. 


In the rest of this section, we drop the sample index n from the activations £in and Y;», 
since the derivatives for each sample n can typically be computed independently from other 


samples.? 


To compute the partial derivatives of the loss term with respect to earlier weights and 
activations, we work our way back through the network, as shown in Figure 5.31. Recall 
from (5.45-5.46) that we first compute a weighted sum s; by taking a dot product between 


the input activations x; and the unit’s weight vector w;, 


J 


We then pass this weighted sum through an activation function h to obtain y; = h(s;). 


To compute the derivative of the loss £,, with respect to the weights, bias, and input 


28This is not the case if batch or other kinds of normalization (Section 5.3.3) are being used. For batch normal- 
ization, we have to accumulate the statistics across all the samples in the batch and then take their derivatives with 
respect to each weight (Ioffe and Szegedy 2015). For instance and group norm, we compute the statistics across all 


the pixels in a given channel or group, and then have to compute these additional derivatives as well. 
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activations, we use the chain rule, 


= a = AA (5.65) 
o E 0 = Gigi, (5.66) 
se = a =e and (5.67) 
= = wg ae = Wije;. (5.68) 


We call the term e; = 0E,, /0s;, i.e., the partial derivative of the loss E,, with respect to the 
summed activation s;, the error, as it gets propagated backward through the network. 

Now, where do these errors come from, i.e., how do we obtain 0E,,/0y;? Recall from 
Figure 5.24 that the outputs from one unit or layer become the inputs for the next layer. In 
fact, for a simple network like the one in Figure 5.24, if we let x;; be the activation that unit 
i receives from unit j (as opposed to just the jth input to unit 2), we can simply set xij = yj. 

Since y;, the output of unit ¿, now serves as input for the other units k > i (assuming the 


units are ordered breadth first), we have 


0En OE, 
O; k>i OL hi k>i 
and OE 
Ee, = A = A! (si) S wrier- (5.70) 
Yi k>i 


In other words, to compute a unit’s (backpropagation) error, we compute a weighted sum 
of the errors coming from the units it feeds into and then multiply this by the derivative of 
the current activation function h’(s;). This backward flow of errors is shown in Figure 5.31, 
where the errors for the three units in the shaded box are computed using weighted sums of 
the errors coming from later in the network. 

This backpropagation rule has a very intuitive explanation. The error (derivative of the 
loss) for a given unit depends on the errors of the units that 1t feeds multiplied by the weights 
that couple them together. This is a simple application of the chain rule. The slope of the 
activation function h’(s;) modulates this interaction. If the unit’s output is clamped to zero or 
small, e.g., with a negative-input ReLU or the “flat” part of a sigmoidal response, the unit’s 
error is itself zero or small. The gradient of the weight, i.e., how much the weight should be 
perturbed to reduce the loss, is a signed product of the incoming activation and the unit's error, 
x;4;e;. This is closely related to the Hebbian update rule (Hebb 1949), which observes that 


synaptic efficiency in biological neurons increases with correlated firing in the presynaptic 
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and postsynaptic cells. An easier way to remember this rule is “neurons wire together if they 
fire together” (Lowel and Singer 1992). 

There are, of course, other computational elements in modern neural networks, including 
convolutions and pooling, which we cover in the next section. The derivatives and error 
propagation through such other units follows the same procedure as we sketched here, i.e., 
recursively apply the chain rule, taking analytic derivatives of the functions being applied, 
until you have the derivatives of the loss function with respect to all the parameters being 
optimized, i.e., the gradient of the loss. 

As you may have noticed, the computation of the gradients with respect to the weights 
requires the unit activations computed in the forward pass. A typical implementation of 
neural network training stores the activations for a given sample and uses these during the 
backprop (backward error propagation) stage to compute the weight derivatives. Modern 
neural networks, however, may have millions of units and hence activations (Figure 5.44). 
The number of activations that need to be stored can be reduced by only storing them at 
certain layers and then re-computing the rest as needed, which goes under the name gradient 
checkpointing (Griewank and Walther 2000; Chen, Xu et al. 2016; Bulatov 2018).2 A more 
extensive review of low-memory training can be found in the technical report by Sohoni, 
Aberger et al. (2019). 


5.3.6 Training and optimization 


At this point, we have all of the elements needed to train a neural network. We have defined 
the network’s topology in terms of the sizes and depths of each layer, specified our activation 
functions, added regularization terms, specified our loss function, and initialized the weights. 
We have even described how to compute the gradients, i.e., the derivatives of the regularized 
loss with respect to all of our weights. What we need at this point is some algorithm to turn 
these gradients into weight updates that will optimize the loss function and produce a network 
that generalizes well to new, unseen data. 

In most computer vision algorithms such as optical flow (Section 9.1.3), 3D reconstruc- 
tion using bundle adjustment (Section 11.4.2), and even in smaller-scale machine learning 
problems such as logistic regression (Section 5.1.3), the method of choice is linearized least 
squares (Appendix A.3). The optimization is performed using a second-order method such 
as Gauss-Newton, in which we evaluate all of the terms in our loss function and then take an 
optimally-sized downhill step using a direction derived from the gradients and the Hessian of 
the energy function. 


2°This name seems a little weird, since it’s actually the activations that are saved instead of the gradients. 
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Unfortunately, deep learning problems are far too large (in terms of number of parameters 
and training samples; see Figure 5.44) to make this approach practical. Instead, practitioners 
have developed a series of optimization algorithms based on extensions to stochastic gradient 
descent (SGD) (Zhang, Lipton et al. 2021, Chapter 11). In SGD, instead of evaluating the 
loss function by summing over all the training samples, as in (5.54) or (5.56), we instead just 
evaluate a single training sample n and compute the derivatives of the associated loss En (w). 
We then take a tiny downhill step along the direction of this gradient. 

In practice, the directions obtained from just a single sample are incredibly noisy esti- 
mates of a good descent direction, so the losses and gradients are usually summed over a 


small subset of the training data, 


Eg(w) = 5 E,(w), (5.71) 
neB 

where each subset B is called a minibatch. Before we start to train, we randomly assign the 
training samples into a fixed set of minibatches, each of which has a fixed size that commonly 
ranges from 32 at the low end to 8k at the higher end (Goyal, Dollar et al. 2017). The resulting 
algorithm is called minibatch stochastic gradient descent, although in practice, most people 

just call it SGD (omitting the reference to minibatches).*° 
After evaluating the gradients g = V yg by summing over the samples in the minibatch, 
it is time to update the weights. The simplest way to do this is to take a small step in the 


gradient direction, 


wt w-ag or (5.72) 


Witi = Wt — 048 (5.73) 


where the first variant looks more like an assignment statement (see, e.g., Zhang, Lipton et 
al. 2021, Chapter 11; Loshchilov and Hutter 2019), while the second makes the temporal 
dependence explicit, using t to denote each successive step in the gradient descent.*! 

The step size parameter a is often called the learning rate and must be carefully adjusted 
to ensure good progress while avoiding overshooting and exploding gradients. In practice, 
it is common to start with a larger (but still small) learning rate a; and to decrease it over 
time so that the optimization settles into a good minimum (Johnson 2020, Lecture 11; Zhang, 
Lipton et al. 2021, Chapter 11). 


30Tn the deep learning community, classic algorithms that sum over all the measurements are called batch gradient 
descent, although this term is not widely used elsewhere, as it is assumed that using all measurement at once is the 
preferred approach. In large-scale problems such as bundle adjustment, it’s possible that using minibatches may 
result in better performance, but this has so far not been explored. 

31T use the index k in discussing iterative algorithms in Appendix A.5. 
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Figure 5.32 Screenshot from http://playground.tensorflow.org, where you can build and 
train your own small network in your web browser. Because the input space is two- 


dimensional, you can visualize the responses to all 2D inputs at each unit in the network. 


Regular gradient descent is prone to stalling when the current solution reaches a “flat 
spot” in the search space, and stochastic gradient descent only pays attention to the errors 
in the current minibatch. For these reasons, the SGD algorithms may use the concept of 
momentum, where an exponentially decaying (“leaky”) running average of the gradients is 


accumulated and used as the update direction, 


Vi+i = PVi + St (5.74) 


Wri = Wt — AtVt. (5.75) 


A relatively large value of p € [0.9, 0.99] is used to give the algorithm good memory, effec- 
tively averaging gradients over more batches.*? 

Over the last decade, a number of more sophisticated optimization techniques have been 
applied to deep network training, as described in more detail in Johnson (2020, Lecture 11) 
and Zhang, Lipton et al. (2021, Chapter 11)). These algorithms include: 


e Nesterov momentum, where the gradient is (effectively) computed at the state predicted 


from the velocity update; 


32Note that a recursive formula such as (5.74), which is the same as a temporal infinite impulse response filter 


(3.2.3) converges in the limit to a value of g/(1 — p), so a needs to be correspondingly adjusted. 
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e AdaGrad (Adaptive Gradient), where each component in the gradient is divided by 
the square root of the per-component summed squared gradients (Duchi, Hazan, and 
Singer 2011); 


e RMSProp, where the running sum of squared gradients is replaced with a leaky (decay- 
ing) sum (Hinton 2012); 


e Adadelta, which augments RMSProp with a leaky sum of the actual per-component 
changes in the parameters and uses these in the gradient re-scaling equation (Zeiler 
2012); 


e Adam, which combines elements of all the previous ideas into a single framework and 


also de-biases the initial leaky estimates (Kingma and Ba 2015); and 
e AdamW, which is Adam with decoupled weight decay (Loshchilov and Hutter 2019). 


Adam and AdamW are currently the most popular optimizers for deep networks, although 
even with all their sophistication, learning rates need to be set carefully (and probably decayed 
over time) to achieve good results. Setting the right hyperparameters, such as the learning 
rate initial value and decay rate, momentum terms such as p, and amount of regularization, 
so that the network achieves good performance within a reasonable training time is itself an 
open research area. The lecture notes by Johnson (2020, Lecture 11) provide some guidance, 
although in many cases, people perform a search over hyperparameters to find which ones 


produce the best performing network. 


A simple two-input example 


A great way to get some intuition on how deep networks update the weights and carve 
out a solution space during training is to play with the interactive visualization at http: 
//playground.tensorflow.org.** As shown in Figure 5.32, just click the “run” (>) button to get 
started, then reset the network to a new start (button to the left of run) and try single-stepping 
the network, using different numbers of units per hidden layer and different activation func- 
tions. Especially when using ReLUs, you can see how the network carves out different parts 
of the input space and then combines these sub-pieces together. Section 5.4.5 discusses visu- 
alization tools to get insights into the behavior of larger, deeper networks. 
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Figure 5.33 Architecture of LeNet-5, a convolutional neural network for digit recognition 
(LeCun, Bottou et al. 1998) O 1998 IEEE. This network uses multiple channels in each layer 
and alternates multi-channel convolutions with downsampling operations, followed by some 


fully connected layers that produce one activation for each of the 10 digits being classified. 


5.4 Convolutional neural networks 


The previous sections on deep learning have covered all of the essential elements of con- 
structing and training deep networks. However, they have omitted what is likely the most 
crucial component of deep networks for image processing and computer vision, which is the 
use of trainable multi-layer convolutions. The idea of convolutional neural networks was 
popularized by LeCun, Bottou et al. (1998), where they introduced the LeNet-5 network for 
digit recognition shown in Figure 5.33.% 

Instead of connecting all of the units in a layer to all the units in a preceding layer, convo- 
lutional networks organize each layer into feature maps (LeCun, Bottou et al. 1998), which 
you can think of as parallel planes or channels, as shown in Figure 5.33. In a convolutional 
layer, the weighted sums are only performed within a small local window, and weights are 
identical for all pixels, just as in regular shift-invariant image convolution and correlation 
(3.12-3.15). 

Unlike image convolution, however, where the same filter is applied to each (color) chan- 
nel, neural network convolutions typically linearly combine the activations from each of the 
Cı input channels in a previous layer and use different convolution kernels for each of the 


Cy output channels, as shown in Figures 5.34-5.35.25 This makes sense, as the main task in 


33 Additional informative interactive demonstrations can be found at https://cs.stanford.edu/people/karpathy/ 
convnetjs. 

34A similar convolutional architecture, but without the gradient descent training procedure, was earlier proposed 
by Fukushima (1980). 

35The number of channels in a given network layer is sometimes called its depth, but the number of layers in a 


deep network is also called its depth. So, be careful when reading network descriptions. 
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Figure 5.34 2D convolution with multiple input and output channels O Glassner (2018). 
Each 2D convolution kernel takes as input all of the Cy channels in the preceding layer, win- 
dowed to a small area, and produces the values (after the activation function non-linearity) 
in one of the C2 channels in the next layer. For each of the output channels, we have S? x Cy 
kernel weights, so the total number of learnable parameters in each convolutional layer is 
S? x Cy x Co. In this figure, we have Ci = 6 input channels and Ca = 4 output channels, 
with an S = 3 convolution window, for a total of 9 x 6 x 4 learnable weights, shown in the 
middle column of the figure. Since the convolution is applied at each of the W x H pixels in 
a given layer, the amount of computation (multiply-adds) in each forward and backward pass 


over one sample in a given layer is WHS?C, C2. 


convolutional neural network layers is to construct local features (Figure 3.40c) and to then 
combine them in different ways to produce more discriminative and semantically meaningful 
features.*° Visualizations of the kinds of features that deep networks extract are shown in 
Figure 5.47 in Section 5.4.5. 

With these intuitions in place, we can write the weighted linear sums (5.45) performed in 


a convolutional layer as 


(i, j, c2) = > Sw l,c1, caJx(i+k, j+1,c1) + d(ca), (5.76) 


c1€{C1} (k, DEN 


where the x(i, j, c1) are the activations in the previous layer, just as in (5.45), M are the S? 
signed offsets in the 2D spatial kernel, and the notation cı € {C1 } denotes c1 € [0, C1). Note 
that because the offsets (k,l) are added to (instead of subtracted from) the (i, j) pixel coor- 
dinates, this operation is actually a correlation (3.13), but this distinction is usually glossed 


36Note that pixels in different input and output channels (within the convolution window size) are fully connected, 


unless grouped convolutions, discussed below, are used. 
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Figure 5.35 2D convolution with multiple batches, input, and output channels, © John- 


son (2020). When doing mini-batch gradient descent, a whole batch of training images or 
features is passed into a convolutional layer, which takes as input all of the Cin channels in 
the preceding layer, windowed to a small area, and produces the values (after the activation 
function non-linearity) in one of the Cout channels in the next layer. As before, for each of the 
output channels, we have Ku X Ky, X Cin kernel weights, so the total number of learnable 
parameters in each convolutional layer is Ky X Kn X Cin X Cin. In this figure, we have 


Cin = 3 input channels and Cout = 6 output channels. 


over.27 


In neural network diagrams such as those shown in Figures 5.33 and 5.39-5.43, it is 
common to indicate the convolution kernel size S and the number of channels in a layer C, 
and only sometimes to show the image dimensions, as in Figures 5.33 and 5.39. Note that 
some neural networks such as the Inception module in GoogLeNet (Szegedy, Liu et al. 2015) 
shown in Figure 5.42 use 1 x 1 convolutions, which do not actually perform convolutions 
but rather combine various channels on a per-pixel basis, often with the goal of reducing the 


dimensionality of the feature space. 


Because the weights in a convolution kernel are the same for all of the pixels within a 
given layer and channel, these weights are actually shared across what would result if we 
drew all of the connections between different pixels in different layers. This means that 
there are many fewer weights to learn than in fully connected layers. It also means that 
during backpropagation, kernel weight updates are summed over all of the pixels in a given 


layer/channel. 


To fully determine the behavior of a convolutional layer, we still need to specify a few 


37 Since the weights in a neural network are learned, this reversal does not really matter. 
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additional parameters.** These include: 


e Padding. Early networks such as LeNet-5 did not pad the image, which therefore 
shrank after each convolution. Modern networks can optionally specify a padding 
width and mode, using one of the choices used with traditional image processing, such 


as zero padding or pixel replication, as shown in Figure 3.13. 


e Stride. The default stride for convolution is 1 pixel, but it is also possible to only 
evaluate the convolution at every nth column and row. For example, the first convo- 
lution layer in AlexNet (Figure 5.39) uses a stride of 4. Traditional image pyramids 


(Figure 3.31) use a stride of 2 when constructing the coarser levels. 


e Dilation. Extra “space” (skipped rows and column) can be inserted between pixel 
samples during convolution, also known as dilated or à trous (with holes, in French, or 
often just “atrous”) convolution (Yu and Koltun 2016; Chen, Papandreou et al. 2018). 
While in principle this can lead to aliasing, it can also be effective at pooling over a 
larger region while using fewer operations and learnable parameters. 


e Grouping. While, by default, all input channels are used to produce each output chan- 
nel, we can also group the input and output layers into G separate groups, each of 
which is convolved separately (Xie, Girshick et al. 2017). G = 1 corresponds to 
regular convolution, while G = C means that each corresponding input channel is 
convolved independently from the others, which is known as depthwise or channel- 
separated convolution (Howard, Zhu et al. 2017; Tran, Wang et al. 2019). 


A nice animation of the effects of these different parameters created by Vincent Dumoulin 
can be found at https://github.com/vdumoulin/conv-_arithmetic as well as Dumoulin and Visin 
(2016). 

In certain applications such as image inpainting (Section 10.5.1), the input image may 
come with an associated binary mask, indicating which pixels are valid and which need to be 
filled in. This is similar to the concept of alpha-matted images we studied in Section 3.1.3. 
In this case, one can use partial convolutions (Liu, Reda et al. 2018), where the input pixels 
are multiplied by the mask pixels and then normalized by the count of non-zero mask pixels. 
The mask channel output is set to 1 if any input mask pixels are non-zero. This resembles 
the pull-push algorithm of Gortler, Grzeszczuk et al. (1996) that we presented in Figure 4.2, 


except that the convolution weights are learned. 


38Most of the neural network building blocks we present in this chapter have corresponding functions in widely 
used deep learning frameworks, where you can get more detailed information about their operation. For example, 
the 2D convolution operator is called Conv2d in PyTorch and is documented at https://pytorch.org/docs/stable/nn. 


html#convolution-layers. 
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A more sophisticated version of partial convolutions is gated convolutions (Yu, Lin et al. 
2019; Chang, Liu et al. 2019), where the per-pixel masks are derived from the previous layer 
using a learned convolution followed by a sigmoid non-linearity. This enables the network 
not only to learn a better measure of per-pixel confidence (weighting), but also to incorporate 


additional features such as user-drawn sketches or derived semantic information. 


5.4.1 Pooling and unpooling 


As we just saw in the discussion of convolution, strides of greater than 1 can be used to reduce 
the resolution of a given layer, as in the first convolutional layer of AlexNet (Figure 5.39). 
When the weights inside the convolution kernel are identical and sum up to 1, this is called 
average pooling and is typically applied in a channel-wise manner. 

A widely used variant is to compute the maximum response within a square window, 
which is called max pooling. Common strides and window sizes for max pooling are a stride 
of 2 and 2 x 2 non-overlapping windows or 3 x 3 overlapping windows. Max pooling layers 
can be thought of as a “logical or”, since they only require one of the units in the pooling 
region to be turned on. They are also supposed to provide some shift invariance over the 
inputs. However, most deep networks are not all that shift-invariant, which degrades their 
performance. The paper by Zhang (2019) has a nice discussion of this issue and some simple 
suggestions to mitigate this problem. 

One issue that commonly comes up is how to backpropagate through a max pooling layer. 
The max pool operator acts like a “switch” that shunts (connects) one of the input units 
to the output unit. Therefore, during backpropagation, we only need to pass the error and 
derivatives down to this maximally active unit, as long as we have remembered which unit 
has this response. 

This same max unpooling mechanism can be used to create a “deconvolution network” 
when searching for the stimulus (Figure 5.47) that most strongly activates a particular unit 
(Zeiler and Fergus 2014). 

If we want a more continuous behavior, we could construct a pooling unit that com- 
putes an Lp norm over its inputs, since the Lp—oo effectively computes a maximum over its 
components (Springenberg, Dosovitskiy et al. 2015). However, such a unit requires more 
computation, so it is not widely used in practice, except sometimes at the final layer, where it 
is known as generalized mean (GeM) pooling (Dollar, Tu et al. 2009; Tolias, Sicre, and Jégou 
2016; Gordo, Almazán et al. 2017; Radenovié, Tolias, and Chum 2019) or dynamic mean 
(DAME) pooling (Yang, Kien Nguyen et al. 2019). In their paper, Springenberg, Dosovitskiy 
et al. (2015) also show that using strided convolution instead of max pooling can produce 


competitive results. 
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Figure 5.36 Transposed convolution (© Dumoulin and Visin (2016)) can be used to upsam- 
ple (increase the size of) an image. Before applying the convolution operator, (s — 1) extra 
rows and columns of zeros are inserted between the input samples, where s is the upsampling 
stride. 


While unpooling can be used to (approximately) reverse the effect of max pooling op- 
eration, if we want to reverse a convolutional layer, we can look at learned variants of the 
interpolation operator we studied in Sections 3.5.1 and 3.5.3. The easiest way to visualize 
this operation is to add extra rows and columns of zeros between the pixels in the input layer, 
and to then run a regular convolution (Figure 5.36). This operation is sometimes called back- 
ward convolution with a fractional stride (Long, Shelhamer, and Darrell 2015), although it is 
more commonly known as transposed convolution (Dumoulin and Visin 2016), because when 
convolutions are written in matrix form, this operation is a multiplication with a transposed 
sparse weight matrix. Just as with regular convolution, padding, stride, dilation, and grouping 
parameters can be specified. However, in this case, the stride specifies the factor by which 
the image will be upsampled instead of downsampled. 


U-Nets and Feature Pyramid Networks 


When discussing the Laplacian pyramid in Section 3.5.3, we saw how image downsampling 
and upsampling can be combined to achieve a variety of multi-resolution image processing 
tasks (Figure 3.33). The same kinds of combinations can be used in deep convolutional 
networks, in particular, when we want the output to be a full-resolution image. Examples 
of such applications include pixel-wise semantic labeling (Section 6.4), image denoising and 
super-resolution (Section 10.3), monocular depth inference (Section 12.8), and neural style 
transfer (Section 14.6). The idea of reducing the resolution of a network and then expanding 
it again is sometimes called a bottleneck and is related to earlier self-supervised network 
training using autoencoders (Hinton and Zemel 1994; Goodfellow, Bengio, and Courville 
2016, Chapter 14). 


One of the earliest applications of this idea was the fully convolutional network developed 
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(c) 


Figure 5.37 (a) The deconvolution network of Noh, Hong, and Han (2015) O 2015 IEEE 
and (b-c) the U-Net of Ronneberger, Fischer, and Brox (2015), drawn using the PlotNeural- 
Net LaTeX package. In addition to the fine-to-coarse-to-fine bottleneck used in (a), the U-Net 
also has skip connections between encoding and decoding layers at the same resolution. 
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Figure 5.38 Screenshot from Andrej Karpathy’s web browser demos at https://cs.stanford. 
edu/people/karpathy/convnetjs, where you can run a number of small neural networks, in- 
cluding CNNSs for digit and tiny image classification. 


by Long, Shelhamer, and Darrell (2015). This paper inspired myriad follow-on architectures, 
including the hourglass-shaped “deconvolution” network of Noh, Hong, and Han (2015), 
the U-Net of Ronneberger, Fischer, and Brox (2015), the atrous convolution network with 
CRF refinement layer of Chen, Papandreou ef al. (2018), and the panoptic feature pyramid 
networks of Kirillov, Girshick et al. (2019). Figure 5.37 shows the general layout of two of 
these networks, which are discussed in more detail in Section 6.4 on semantic segmentation. 
We will see other uses of these kinds of backbone networks (He, Gkioxari et al. 2017) in later 
sections on image denoising and super-resolution (Section 10.3), monocular depth inference 
(Section 12.8), and neural style transfer (Section 14.6). 


5.4.2 Application: Digit classification 


One of the earliest commercial application of convolutional neural networks was the LeNet-5 
system created by LeCun, Bottou et al. (1998) whose architecture is shown in Figure 5.33. 
This network contained most of the elements of modern CNNs, although it used sigmoid 
non-linearities, average pooling, and Gaussian RBF units instead of softmax at its output. If 
you want to experiment with this simple digit recognition CNN, you can visit the interac- 
tive JavaScript demo created by Andrej Karpathy at https://cs.stanford.edu/people/karpathy/ 
convnetjs (Figure 5.38). 

The network was initially deployed around 1995 by AT&T to automatically read checks 
deposited in NCR ATM machines to verify that the written and keyed check amounts were 
the same. The system was then incorporated into NCR’s high-speed check reading systems, 
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Figure 5.39 Architecture of the SuperVision deep neural network (more commonly known 
as “AlexNet”), courtesy of Matt Deitke (redrawn from (Krizhevsky, Sutskever, and Hinton 
2012)). The network consists of multiple convolutional layers with ReLU activations, max 


pooling, some fully connected layers, and a softmax to produce the final class probabilities. 


which at some point were processing somewhere between 10% and 20% of all the checks in 
the US.” 

Today, variants of the LeNet-5 architecture (Figure 5.33) are commonly used as the first 
convolutional neural network introduced in courses and tutorials on the subject. Although 
the MNIST dataset (LeCun, Cortes, and Burges 1998) originally used to train LeNet-5 is 
still sometimes used, 1t is more common to use the more challenging CIFAR-10 (Krizhevsky 
2009) or Fashion MNIST (Xiao, Rasul, and Vollgraf 2017) as datasets for training and testing. 


5.4.3 Network architectures 


While modern convolutional neural networks were first developed and deployed in the late 
1990s, it was not until the breakthrough publication by Krizhevsky, Sutskever, and Hinton 
(2012) that they started outperforming more traditional techniques on natural image classi- 
fication (Figure 5.40). As you can see in this figure, the AlexNet system (the more widely 
used name for their SuperVision network) led to a dramatic drop in error rates from 25.8% to 
16.4%. This was rapidly followed in the next few years with additional dramatic performance 
improvements, due to further developments as well as the use of deeper networks, e.g., from 
the original 8-layer AlexNet to a 152-layer ResNet. 

Figure 5.39 shows the architecture of the SuperVision network, which contains a series 


of convolutional layers with ReLU (rectified linear) non-linearities, max pooling, some fully 


39 This information courtesy of Yann LeCun and Larry Jackel, who were two of the principals in the development 
of this system. 
40See, e.g., https://pytorch.org/tutorials/beginner/blitz/cifar 1 0_tutorial.html. 
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Figure 5.40 Top-5 error rate and network depths of winning entries from the ImageNet 
Large Scale Visual Recognition Challenge (ILSVRC) © Li, Johnson, and Yeung (2019). 


connected layers, and a final softmax layer, which is fed into a multi-class cross-entropy loss. 
Krizhevsky, Sutskever, and Hinton (2012) also used dropout (Figure 5.29), small translation 
and color manipulation for data augmentation, momentum, and weight decay (Lo weight 
penalties). 

The next few years after the publication of this paper saw dramatic improvement in the 
classification performance on the ImageNet Large Scale Visual Recognition Challenge (Rus- 
sakovsky, Deng et al. 2015), as shown in Figure 5.40. A nice description of the innovations 
in these various networks, as well as their capacities and computational cost, can be found in 
the lecture slides by Justin Johnson (2020, Lecture 8). 


The winning entry from 2013 by Zeiler and Fergus (2014) used a larger version of AlexNet 
with more channels in the convolution stages and lowered the error rate by about 30%. The 
2014 Oxford Visual Geometry Group (VGG) winning entry by Simonyan and Zisserman 
(2014b) used repeated 3 x 3 convolution/ReLU blocks interspersed with 2 x 2 max pooling 
and channel doubling (Figure 5.41), followed by some fully connected layers, to produce 16- 
19 layer networks that further reduced the error by 40%. However, as shown in Figure 5.44, 


this increased performance came at a greatly increased amount of computation. 


The 2015 GoogLeNet of Szegedy, Liu et al. (2015) focused instead on efficiency. Goog- 
LeNet begins with an aggressive stem network that uses a series of strided and regular con- 
volutions and max pool layers to quickly reduce the image resolutions from 224? to 28”. It 
then uses a number of Inception modules (Figure 5.42), each of which is a small branching 
neural network whose features get concatenated at the end. One of the important character- 


istics of this module is that it uses 1 x 1 “bottleneck” convolutions to reduce the number of 
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Figure 5.41 The VGGI6 network of Simonyan and Zisserman (2014b) O Glassner (2018). 
(a) The network consists of repeated zero-pad, 3 x 3 convolution, ReLU blocks interspersed 
with 2 x 2 max pooling and a doubling in the number of channels. This is followed by some 
fully connected and dropout layers, with a final softmax into the 1,000 ImagetNet categories. 
(b) Some of the schematic neural network symbols used by Glassner (2018). 
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Figure 5.42 An Inception module from (Szegedy, Liu et al. 2015) O 2015 IEEE, which 
combines dimensionality reduction, multiple convolution sizes, and max pooling as different 
channels that get stacked together into a final feature map. 


— 
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Figure 5.43 ResNet residual networks (He, Zhang et al. 2016a) O 2016 IEEE, showing 
skip connections going around a series of convolutional layers. The figure on the right uses 
a bottleneck to reduce the number of channels before the convolution. Having direct con- 
nections that shortcut the convolutional layer allows gradients to more easily flow backward 


through the network during training. 


channels before performing larger 3 x 3 and 5 x 5 convolutions, thereby saving a significant 
amount of computation. This kind of projection followed by an additional convolution is 
similar in spirit to the approximation of filters as a sum of separable convolutions proposed 
by Perona (1995). GoogLeNet also removed the fully connected (MLP) layers at the end, 
relying instead on global average pooling followed by one linear layer before the softmax. 
Its performance was similar to that of VGG but at dramatically lower computation and model 
size costs (Figure 5.44). 

The following year saw the introduction of Residual Networks (He, Zhang et al. 2016a), 
which dramatically expanded the number of layers that could be successfully trained (Fig- 
ure 5.40). The main technical innovation was the introduction of skip connections (originally 
called “shortcut connections”), which allow information (and gradients) to flow around a set 
of convolutional layers, as shown in Figure 5.43. The networks are called residual networks 
because they allow the network to learn the residuals (differences) between a set of incom- 
ing and outgoing activations. A variant on the basic residual block is the “bottleneck block” 
shown on the right side of Figure 5.43, which reduces the number of channels before per- 
forming the 3 x 3 convolutional layer. A further extension, described in (He, Zhang et al. 
2016b), moves the ReLU non-linearity to before the residual summation, thereby allowing 
true identity mappings to be modeled at no cost. 

To build a ResNet, various residual blocks are interspersed with strided convolutions and 
channel doubling to achieve the desired number of layers. (Similar downsampling stems 
and average pooled softmax layers as in GoogLeNet are used at the beginning and end.) By 
combining various numbers of residual blocks, ResNets consisting of 18, 34, 50, 101, and 


152 layers have been constructed and evaluated. The deeper networks have higher accuracy 
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but more computational cost (Figure 5.44). In 2015, ResNet not only took first place in 
the ILSVRC (ImageNet) classification, detection, and localization challenges, but also took 
first place in the detection and segmentation challenges on the newer COCO dataset and 
benchmark (Lin, Maire et al. 2014). 


Since then, myriad extensions and variants have been constructed and evaluated. The 
ResNeXt system from Xie, Girshick et al. (2017) used grouped convolutions to slightly im- 
prove accuracy. Denseley connected CNNs (Huang, Liu et al. 2017) added skip connections 
between non-adjacent convolution and/or pool blocks. Finally, the Squeeze-and-Excitation 
network (SENet) by Hu, Shen, and Sun (2018) added global context (via global pooling) to 
each layer to obtain a noticeable increase in accuracy. More information about these and 
other CNN architectures can be found in both the original papers as well as class notes on 
this topic (Li, Johnson, and Yeung 2019; Johnson 2020). 


Mobile networks 


As deep neural networks were getting deeper and larger, a countervailing trend emerged in 
the construction of smaller, less computationally expensive networks that could be used in 
mobile and embedded applications. One of the earliest networks tailored for lighter-weight 
execution was MobileNets (Howard, Zhu et al. 2017), which used depthwise convolutions, 
a special case of grouped convolutions where the number of groups equals the number of 
channels. By varying two hyperparameters, namely a width multiplier and a resolution mul- 
tiplier, the network architecture could be tuned along an accuracy vs. size vs. computational 
efficiency tradeoff. The follow-on MobileNetV2 system (Sandler, Howard et al. 2018) added 
an “inverted residual structure”, where the shortcut connections were between the bottleneck 
layers. ShuffleNet (Zhang, Zhou et al. 2018) added a “shuffle” stage between grouped con- 
volutions to enable channels in different groups to co-mingle. ShuffleNet V2 (Ma, Zhang et 
al. 2018) added a channel split operator and tuned the network architectures using end-to-end 
performance measures. Two additional networks designed for computational efficiency are 
ESPNet (Mehta, Rastegari et al. 2018) and ESPNetv2 (Mehta, Rastegari et al. 2019), which 
use pyramids of (depth-wide) dilated separable convolutions. 


The concepts of grouped, depthwise, and channel-separated convolutions continue to be a 
widely used tool for managing computational efficiency and model size (Choudhary, Mishra 
et al. 2020), not only in mobile networks, but also in video classification (Tran, Wang et al. 
2019), which we discuss in more detail in Section 5.5.2. 
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Figure 5.44 Network accuracy vs. size and operation counts (Canziani, Culurciello, and 
Paszke 2017) O 2017 IEEE: In the right figure, the network accuracy is plotted against op- 
eration count (1-40 G-Ops), while the size of the circle indicates the number of parameters 
(10-155 M). The initials BN indicate a batch normalized version of a network. 


5.4.4 Model zoos 


A great way to experiment with these various CNN architectures is to download pre-trained 


models from a model zoo*! 


such as the TorchVision library at https://github.com/pytorch/ 
vision. If you look in the torchvision/models folder, you will find implementations of AlexNet, 
VGG, GoogleNet, Inception, ResNet, DenseNet, MobileNet, and ShuffleNet, along with 
other models for classification, object detection, and image segmentation. Even more recent 
models, some of which are discussed in the upcoming sections, can be found in the PyTorch 
Image Models library (timm), https://github.com/rwightman/pytorch-image-models. Similar 
collections of pre-trained models exist for other languages, e.g., https://www.tensorflow.org/ 
lite/models for efficient (mobile) TensorFlow models. 

While people often download and use pre-trained neural networks for their applications, 
it is more common to at least fine-tune such networks on data more characteristic of the ap- 
plication (as opposed to the public benchmark data on which most zoo models are trained). 
It is also quite common to replace the last few layers, i.e., the head of the network (so called 
because it lies at the top of a layer diagram when the layers are stacked bottom-to-top) while 
leaving the backbone intact. The terms backbone and head(s) are widely used and were pop- 


ularized by the Mask-RCNN paper (He, Gkioxari et al. 2017). Some more recent papers 


41 The name “model zoo” itself is a fanciful invention of Evan Shelhamer, lead developer on Caffe (Jia, Shelhamer 
et al. 2014), who first used it on https://caffe.berkeleyvision.org/model_zoo.html to describe a collection of various 
pre-trained DNN models (personal communication). 

42 See, e.g., https://classyvision.ai/tutorials/fine_tuning and (Zhang, Lipton et al. 2021, Section 13.2). 


5.4 Convolutional neural networks 305 


refer to the backbone and head as the trunk and its branches (Ding and Tao 2018; Kirillov, 
Girshick et al. 2019; Bell, Liu et al. 2020), with the term neck also being occasionally used 
(Chen, Wang ef al. 2019).4 

When adding a new head, its parameters can be trained using the new data specific to the 
intended application. Depending on the amount and quality of new training data available, 
the head can be as simple as a linear model such as an SVM or logistic regression/softmax 
(Donahue, Jia et al. 2014; Sharif Razavian, Azizpour et al. 2014), or as complex as a fully 
connected or convolutional network (Xiao, Liu et al. 2018). Fine-tuning some of the layers in 
the backbone is also an option, but requires sufficient data and a slower learning rate so that 
the benefits of the pre-training are not lost. The process of pre-training a machine learning 
system on one dataset and then applying it to another domain is called transfer learning 
(Pan and Yang 2009). We will take a closer look at transfer learning in Section 5.4.7 on 


self-supervised learning. 


Model size and efficiency 


As you can tell from the previous discussion, neural network models come in a large variety 
of sizes (typically measured in number of parameters, i.e., weights and biases) and com- 
putational loads (often measured in FLOPs per forward inference pass). The evaluation by 
Canziani, Culurciello, and Paszke (2017), summarized in Figure 5.44, gives a snapshot of the 
performance (accuracy and size+operations) of the top-performing networks on the ImageNet 
challenge from 2012-2017. In addition to the networks we have already discussed, the study 
includes Inception-v3 (Szegedy, Vanhoucke et al. 2016) and Inception-v4 (Szegedy, Ioffe et 
al. 2017). 

Because deep neural networks can be so memory- and compute-intensive, a number of 
researchers have investigated methods to reduce both, using lower precision (e.g., fixed-point) 
arithmetic and weight compression (Han, Mao, and Dally 2016; Iandola, Han et al. 2016). 
The XNOR-Net paper by Rastegari, Ordonez et al. (2016) investigates using binary weights 
(on-off connections) and optionally binary activations. It also has a nice review of previous 
binary networks and other compression techniques, as do more recent survey papers (Sze, 
Chen et al. 2017; Gu, Wang et al. 2018; Choudhary, Mishra et al. 2020). 


Neural Architecture Search (NAS) 


One of the most recent trends in neural network design is the use of Neural Architecture 


Search (NAS) algorithms to try different network topologies and parameterizations (Zoph 


BClassy Vision uses a trunk and heads terminology, https://classyvision.ai/tutorials/classy model. 
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Figure 5.45 ImageNet accuracy vs. (a) size (# of parameters) and (b) operation counts for 
a number of recent efficient networks (Wan, Dai et al. 2020) © 2020 IEEE. 


and Le 2017; Zoph, Vasudevan et al. 2018; Liu, Zoph et al. 2018; Pham, Guan et al. 2018; 
Liu, Simonyan, and Yang 2019; Hutter, Kotthoff, and Vanschoren 2019). This process is 
more efficient (in terms of a researcher’s time) than the trial-and-error approach that charac- 
terized earlier network design. Elsken, Metzen, and Hutter (2019) survey these and additional 
papers on this rapidly evolving topic. More recent publications include FBNet (Wu, Dai et 
al. 2019), RandomNets (Xie, Kirillov et al. 2019) , EfficientNet (Tan and Le 2019), RegNet 
(Radosavovic, Kosaraju et al. 2020), FBNetV2 (Wan, Dai et al. 2020), and EfficientNetV2 
(Tan and Le 2021). It is also possible to do unsupervised neural architecture search (Liu, 
Dollar et al. 2020). Figure 5.45 shows the top-1% accuracy on ImageNet vs. the network 
size (# of parameters) and forward inference operation counts for a number of recent network 
architectures (Wan, Dai et al. 2020). Compared to the earlier networks shown in Figure 5.44, 
the newer networks use 10x (or more) fewer parameters. 


Deep learning software 


Over the last decade, a large number of deep learning software frameworks and programming 
language extensions have been developed. The Wikipedia entry on deep learning software 
lists over twenty such frameworks, about a half of which are still being actively developed.** 
While Caffe (Jia, Shelhamer et al. 2014) was one of the first to be developed and used for 
computer vision applications, 1t has mostly been supplanted by PyTorch and TensorFlow, at 
least if we judge by the open-source implementations that now accompany most computer 
vision research papers. 


“4https://en.wikipedia.org/wiki/Comparison_of_deep-learning_software 
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Figure 5.46 A Hinton diagram showing the weights connecting the units in a a three 
layer neural network, courtesy of Geoffrey Hinton. The size of each small box indicates the 


magnitude of each weight and its color (black or white) indicates the sign. 


Andrew Glassner’s (2018) introductory deep learning book uses the Keras library because 
of its simplicity. The Dive into Deep Learning book (Zhang, Lipton et al. 2021) and associ- 
ated course (Smola and Li 2019) use MXNet for all the examples in the text, but they have 
recently released PyTorch and TensorFlow code samples as well. Stanford’s CS231n (Li, 
Johnson, and Yeung 2019) and Johnson (2020) include a lecture on the fundamentals of Py- 
Torch and TensorFlow. Some classes also use simplified frameworks that require the students 
to implement more components, such as the Educational Framework (EDF) developed by 
McAllester (2020) and used in Geiger (2021). 

In addition to software frameworks and libraries, deep learning code development usually 
benefits from good visualization libraries such as TensorBoard* and Visdom.*° And in addi- 
tion to the model zoos mentioned earlier in this section, there are even higher-level packages 
such as Classy Vision,” which allow you to train or fine-tune your own classifier with no 
or minimal programming. Andrej Karpathy also provides a useful guide for training neu- 
ral networks at http://karpathy.github.io/2019/04/25/recipe, which may help avoid common 


issues. 


5.4.5 Visualizing weights and activations 


Visualizing intermediate and final results has always been an integral part of computer vision 
algorithm development (e.g., Figures 1.7—1.11) and is an excellent way to develop intuitions 
and debug or refine results. In this chapter, we have already seen examples of tools for simple 


45 https://www.tensorflow.org/tensorboard 
46https://github.com/fossasia/visdom 
4Thttps://classyvision.ai 
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two-input neural network visualizations, e.g., the TensorFlow Playground in Figure 5.32 and 
ConvNetJS in Figure 5.38. In this section, we discuss tools for visualizing network weights 
and, more importantly, the response functions of different units or layers in a network. 

For a simple small network such as the one shown in Figure 5.32, we can indicate the 
strengths of connections using line widths and colors. What about networks with more units? 
A clever way to do this, called Hinton diagrams in honor of its inventor, is to indicate the 
strengths of the incoming and outgoing weights as black or white boxes of different sizes, as 
shown in Figure 5.46 (Ackley, Hinton, and Sejnowski 1985; Rumelhart, Hinton, and Williams 
1986b).* 

If we wish to display the set of activations in a given layer, e.g., the response of the 
final 10-category layer in MNIST or CIFAR-10, across some or all of the inputs, we can 
use non-linear dimensionality reduction techniques such as t-SNE and UMap discussed in 
Section 5.2.4 and Figure 5.21. 

How can we visualize what individual units (“neurons”) in a deep network respond to? 
For the first layer in a network (Figure 5.47, upper left corner), the response can be read 
directly from the incoming weights (grayish images) for a given channel. We can also find 
the patches in the validation set that produce the largest responses across the units in a given 
channel (colorful patches in the upper left corner of Figure 5.47). (Remember that in a con- 
volutional neural network, different units in a particular channel respond similarly to shifted 
versions of the input, ignoring boundary and aliasing effects.) 

For deeper layers in a network, we can again find maximally activating patches in the 
input images. Once these are found, Zeiler and Fergus (2014) pair a deconvolution network 
with the original network to backpropagate feature activations all the way back to the image 
patch, which results in the grayish images in layers 2-5 in Figure 5.47. A related technique 
called guided backpropagation developed by Springenberg, Dosovitskiy et al. (2015) pro- 
duces slightly higher contrast results. 

Another way to probe a CNN feature map is to determine how strongly parts of an input 
image activate units in a given channel. Zeiler and Fergus (2014) do this by masking sub- 
regions of the input image with a gray square, which not only produces activation maps, but 
can also show the most likely labels associated with each image region (Figure 5.48). Si- 
monyan, Vedaldi, and Zisserman (2013) describe a related technique they call saliency maps, 
Nguyen, Yosinski, and Clune (2016) call their related technique activation maximization, and 
Selvaraju, Cogswell ef al. (2017) call their visualization technique gradient-weighted class 


activation mapping (Grad-CAM). 


4SIn the early days of neural networks, bit-mapped displays and printers only supported 1-bit black and white 


images. 
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Layer 4 


Figure 5.47 Visualizing network weights and features (Zeiler and Fergus 2014) © 2014 
Springer. Each visualized convolutional layer is taken from a network adapted from the Su- 
perVision net of Krizhevsky, Sutskever, and Hinton (2012). The 3 x 3 subimages denote the 
top nine responses in one feature map (channel in a given layer) projected back into pixel 
space (higher layers project to larger pixel patches), with the color images on the right show- 
ing the most responsive image patches from the validation set, and the grayish signed images 


on the left showing the corresponding maximum stimulus pre-images. 
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Figure 5.48 Heat map visualization from Zeiler and Fergus (2014) O 2014 Springer. By 
covering up portions of the input image with a small gray square, the response of a highly ac- 
tive channel in layer 5 can be visualized (second column), as can the feature map projections 
(third column), the likelihood of the correct class, and the most likely class per pixel. 


Figure 5.49 Feature visualization of how GoogLeNet (Szegedy, Liu et al. 2015) trained 
on ImageNet builds up its representations over different layers, from Olah, Mordvintsev, and 
Schubert (2017). 
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Figure 5.50 Examples of adversarial images O Szegedy, Zaremba et al. (2013). For each 


original image in the left column, a small random perturbation (shown magnified by 10x 
in the middle column) is added to obtain the image in the right column, which is always 


classified as an ostrich. 


Many more techniques for visualizing neural network responses and behaviors have been 
described in various papers and blogs (Mordvintsev, Olah, and Tyka 2015; Zhou, Khosla et 
al. 2015; Nguyen, Yosinski, and Clune 2016; Bau, Zhou et al. 2017; Olah, Mordvintsev, and 
Schubert 2017; Olah, Satyanarayan et al. 2018; Cammarata, Carter et al. 2020), as well as 
the extensive lecture slides by Johnson (2020, Lecture 14). Figure 5.49 shows one example, 
visualizing different layers in a pre-trained GoogLeNet. OpenAl also recently released a 
great interactive tool called Microscope,*” which allows people to visualize the significance 
of every neuron in many common neural networks. 


5.4.6 Adversarial examples 


While techniques such as guided backpropagation can help us better visualize neural network 
responses, they can also be used to “trick” deep networks into misclassifying inputs by subtly 
perturbing them, as shown in Figure 5.50. The key to creating such images is to take a set 
of final activations and to then backpropagate a gradient in the direction of the “fake” class, 
updating the input image until the fake class becomes the dominant activation. Szegedy, 
Zaremba et al. (2013) call such perturbed images adversarial examples. 

Running this backpropagation requires access to the network and its weights, which 
means that this is a white box attack, as opposed to a black box attack, where nothing is 


49 https://microscope.openai.com/models 
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known about the network. Surprisingly, however, the authors find “... that adversarial exam- 
ples are relatively robust, and are shared by neural networks with varied number of layers, 
activations or trained on different subsets of the training data.” 

The initial discovery of adversarial attacks spurred a flurry of additional investigations 
(Goodfellow, Shlens, and Szegedy 2015; Nguyen, Yosinski, and Clune 2015; Kurakin, Good- 
fellow, and Bengio 2016; Moosavi-Dezfooli, Fawzi, and Frossard 2016; Goodfellow, Paper- 
not et al. 2017). Eykholt, Evtimov et al. (2018) show how adding simple stickers to real 
world objects (such as stop signs) can cause neural networks to misclassify photographs of 
such objects. Hendrycks, Zhao et al. (2021) have created a database of natural images that 
consistently fool popular deep classification networks trained on ImageNet. And while ad- 
versarial examples are mostly used to demonstrate the weaknesses of deep learning models, 
they can also be used to improve recognition (Xie, Tan et al. 2020). 

Ilyas, Santurkar et al. (2019) try to demystify adversarial examples, finding that instead 
of making the anticipated large-scale perturbations that affect a human label, they are per- 
forming a type of shortcut learning (Lapuschkin, Wáldchen et al. 2019; Geirhos, Jacobsen 
et al. 2020). They find that optimizers are exploiting the non-robust features for an image 
label; that is, non-random correlations for an image class that exist in the dataset, but are not 
easily detected by humans. These non-robust features look merely like noise to a human ob- 
server, leaving images perturbed by them predominantly the same. Their claim is supported 
by training classifiers solely on non-robust features and finding that they correlate with image 
classification performance. 

Are there ways to guard against adversarial attacks? The cleverhans software library 
(Papernot, Faghri et al. 2018) provides implementations of adversarial example construction 
techniques and adversarial training. There’s also an associated http://www.cleverhans.io blog 
on security and privacy in machine learning. Madry, Makelov et al. (2018) show how to train 
a network that is robust to bounded additive perturbations in known test images. There’s also 
recent work on detecting (Qin, Frosst et al. 2020b) and deflecting adversarial attacks (Qin, 
Frosst et al. 2020a) by forcing the perturbed images to visually resemble their (false) target 
class. This continues to be an evolving area, with profound implications for the robustness 
and safety of machine learning-based applications, as is the issue of dataset bias (Torralba and 
Efros 2011), which can be guarded against, to some extent, by testing cross-dataset transfer 


performance (Ranftl, Lasinger ef al. 2020). 


5.4.7 Self-supervised learning 


As we mentioned previously, it is quite common to pre-train a backbone (or trunk) network 


for one task, e.g., whole image classification, and to then replace the final (few) layers with a 
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new head (or one or more branches), which are then trained for a different task, e.g., semantic 
image segmentation (He, Gkioxari et al. 2017). Optionally, the last few layers of the original 
backbone network can be fine-tuned. 

The idea of training on one task and then using the learning on another is called transfer 
learning, while the process of modifying the final network to its intended application and 
statistics is called domain adaptation. While this idea was originally applied to backbones 
trained on labeled datasets such as ImageNet, i.e., in a fully supervised manner, the possibility 
of pre-training on the immensely larger set of unlabeled real-world images always remained 
a tantalizing possibility. 

The central idea in self-supervised learning is to create a supervised pretext task where 
the labels can be automatically derived from unlabeled images, e.g., by asking the network to 
predict a subset of the information from the rest. Once pre-trained, the network can then be 
modified and fine-tuned on the final intended downstream task. Weng (2019) has a wonderful 
introductory blog post on this topic, and Zisserman (2018) has a great lecture, where the term 
proxy task is used. Additional good introductions can be found in the survey by Jing and Tian 
(2020) and the bibliography by Ren (2020). 

Figure 5.51 shows some examples of pretext tasks that have been proposed for pre- 


training image classification networks. These include: 


Context prediction (Doersch, Gupta, and Efros 2015): take nearby image patches and 


predict their relative positions. 


Context encoders (Pathak, Krahenbuhl et al. 2016): inpaint one or more missing re- 


gions in an image. 


9-tile jigsaw puzzle (Noroozi and Favaro 2016): rearrange the tiles into their correct 


positions. 


Colorizing black and white images (Zhang, Isola, and Efros 2016). 


Rotating images by multiples of 90° to make them upright (Gidaris, Singh, and Ko- 
modakis 2018). The paper compares itself against 11 previous self-supervised tech- 


niques. 


In addition to using single-image pretext tasks, many researchers have used video clips, 
since successive frames contain semantically related content. One way to use this infor- 
mation is to order video frames correctly in time, i.e., to use a temporal version of context 
prediction and jigsaw puzzles (Misra, Zitnick, and Hebert 2016; Wei, Lim et al. 2018). An- 


other is to extend colorization to video, with the colors in the first frame given (Vondrick, 
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Figure 5.51 Examples of self-supervised learning tasks: (a) guessing the relative positions 
of image patches—can you guess the answers to QI and Q2? (Doersch, Gupta, and Efros 
2015) © 2015 IEEE; (b) solving a nine-tile jigsaw puzzle (Noroozi and Favaro 2016) © 2016 
Springer; (c) image colorization (Zhang, Isola, and Efros 2016) © 2016 Springer; (d) video 
color transfer for tracking (Vondrick, Shrivastava et al. 2018) © 2016 Springer. 


Shrivastava et al. 2018), which encourages the network to learn semantic categories and cor- 
respondences. And since videos usually come with sounds, these can be used as additional 
cues in self-supervision, e.g., by asking a network to align visual and audio signals (Chung 
and Zisserman 2016; Arandjelovic and Zisserman 2018; Owens and Efros 2018), or in an 
unsupervised (contrastive) framework (Alwassel, Mahajan et al. 2020; Patrick, Asano et al. 
2020). 

Since self-supervised learning shows such great promise, an open question is whether 
such techniques could at some point surpass the performance of fully-supervised networks 
trained on smaller fully-labeled datasets.% Some impressive results have been shown using 


https://people.eecs.berkeley.edu/~efros/gelato_bet.html 
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semi-supervised (weak) learning (Section 5.2.5) on very large (300M-3.5B) partially or nois- 
ily labeled datasets such as JFT-300M (Sun, Shrivastava et al. 2017) and Instagram hashtags 
(Mahajan, Girshick et al. 2018). Other researchers have tried simultaneously using super- 
vised learning on labeled data and self-supervised pretext task learning on unlabeled data 
(Zhai, Oliver et al. 2019; Sun, Tzeng et al. 2019). It turns out that getting the most out of 
such approaches requires careful attention to dataset size, model architecture and capacity, 
and the extract details (and difficulty) of the pretext tasks (Kolesnikov, Zhai, and Beyer 2019; 
Goyal, Mahajan et al. 2019; Misra and Maaten 2020). At the same time, others are investi- 
gating how much real benefit pre-training actually gives in downstream tasks (He, Girshick, 
and Dollár 2019; Newell and Deng 2020; Feichtenhofer, Fan et al. 2021). 


Semi-supervised training systems automatically generate ground truth labels for pretext 
tasks so that these can be used in a supervised manner (e.g, by minimizing classification 
errors). An alternative is to use unsupervised learning with a contrastive loss (Section 5.3.4) or 
other ranking loss (Gómez 2019) to encourage semantically similar inputs to produce similar 
encodings while spreading dissimilar inputs further apart. This is commonly now called 


contrastive (metric) learning. 


Wu, Xiong et al. (2018) train a network to produce a separate embedding for each instance 
(training example), which they store in a moving average memory bank as new samples are 
fed through the neural network being trained. They then classify new images using near- 
est neighbors in the embedding space. Momentum Contrast (MoCo) replaces the memory 
bank with a fixed-length queue of encoded samples fed through a temporally adapted mo- 
mentum encoder, which is separate from the actual network being trained (He, Fan et al. 
2020). Pretext-invariant representation learning (PIRL) uses pretext tasks and “multi-crop” 
data augmentation, but then compares their outputs using a memory bank and contrastive loss 
(Misra and Maaten 2020). SimCLR (simple framework for contrastive learning) uses fixed 
mini-batches and applies a contrastive loss (normalized temperature cross-entropy, similar to 
(5.58)) between each sample in the batch and all the other samples, along with aggressive 
data augmentation (Chen, Kornblith et al. 2020). MoCo v2 combines ideas from MoCo and 
SimCLR to obtain even better results (Chen, Fan et al. 2020). Rather than directly comparing 
the generated embeddings, a fully connected network (MLP) is first applied. 


Contrastive losses are a useful tool in metric learning, since they encourage distances 
in an embedding space to be small for semantically related inputs. An alternative is to use 
deep clustering to similarly encourage related inputs to produce similar outputs (Caron, Bo- 
janowski et al. 2018; Ji, Henriques, and Vedaldi 2019; Asano, Rupprecht, and Vedaldi 2020; 
Gidaris, Bursuc et al. 2020; Yan, Misra et al. 2020). Some of the latest results using clustering 


for unsupervised learning now produce results competitive with contrastive metric learning 
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and also suggest that the kinds of data augmentation being used are even more important than 
the actual losses that are chosen (Caron, Misra et al. 2020; Tian, Chen, and Ganguli 2021). 
In the context of vision and language (Section 6.6), CLIP (Radford, Kim et al. 2021) has 
achieved remarkable generalization for image classification using contrastive learning and 
“natural-language supervision.” With a dataset of 400 million text and image pairs, their task 
is to take in a single image and a random sample of 32,768 text snippets and predict which 


text snippet is truly paired with the image. 


Interestingly, it has recently been discovered that representation learning that only en- 
forces similarity between semantically similar inputs also works well. This seems counter- 
intuitive, because without negative pairs as in contrastive learning, the representation can 
easily collapse to trivial solutions by predicting a constant for any input and maximizing sim- 
ilarity. To avoid this collapse, careful attention is often paid to the network design. Bootstrap 
Your Own Latent (BYOL) (Grill, Strub et al. 2020) shows that with a momentum encoder, an 
extra predictor MLP on the online network side, and a stop-gradient operation on the target 
network side, one can successfully remove the negatives from MoCo v2 training. SimSiam 
(Chen and He 2021) further shown that even the momentum encoder is not required and only 
a stop-gradient operation is sufficient for the network to learn meaningful representations. 
While both systems jointly train the predictor MLP and the encoder with gradient updates, 
1t has been even more recently shown that the predictor weights can be directly set using 
statistics of the input right before the predictor layer (Tian, Chen, and Ganguli 2021). Feicht- 
enhofer, Fan ef al. (2021) compare a number of these unsupervised representation learning 
techniques on a variety of video understanding tasks and find that the learned spatiotemporal 


representations generalize well to different tasks. 


Contrastive learning and related work rely on compositions of data augmentations (e.g. 
color jitters, random crops, etc.) to learn representations that are invariant to such changes 
(Chen and He 2021). An alternative attempt is to use generative modeling (Chen, Radford 
et al. 2020), where the representations are pre-trained by predicting pixels either in an auto- 
regressive (GPT- or other language model) manner or a de-noising (BERT-, masked auto- 
encoder) manner. Generative modeling has the potential to bridge self-supervised learning 


across domains from vision to NLP, where scalable pre-trained models are now dominant. 


One final variant on self-supervised learning is using a student-teacher model, where the 
teacher network is used to provide training examples to a student network. These kinds 
of architectures were originally called model compression (Bucila, Caruana, and Niculescu- 
Mizil 2006) and knowledge distillation (Hinton, Vinyals, and Dean 2015) and were used to 
produce smaller models. However, when coupled with additional data and larger capacity 


networks, they can also be used to improve performance. Xie, Luong et al. (2020) train 
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an EfficientNet (Tan and Le 2019) on the labeled ImageNet training set, and then use this 
network to label an additional 300M unlabeled images. The true labels and pseudo-labeled 
images are then used to train a higher-capacity “student”, using regularization (e.g., dropout) 
and data augmentation to improve generalization. The process is then repeated to yield further 
improvements. 

In all, self-supervised learning is currently one of the most exciting sub-areas in deep 
learning,>! and many leading researchers believe that it may hold the key to even better deep 
learning (LeCun and Bengio 2020). To explore implementations further, VISSL provides 
open-source PyTorch implementations of many state-of-the-art self-supervised learning mod- 


els (with weights) that were described in this section.’ 


5.5 More complex models 


While deep neural networks started off being used in 2D image understanding and processing 
applications, they are now also widely used for 3D data such as medical images and video 
sequences. We can also chain a series of deep networks together in time by feeding the results 
from one time frame to the next (or even forward-backward). In this section, we look first 
at three-dimensional convolutions and then at recurrent models that propagate information 
forward or bi-directionally in time. We also look at generative models that can synthesize 


completely new images from semantic or related inputs. 


5.5.1 Three-dimensional CNNs 


As we just mentioned, deep neural networks in computer vision started off being used in the 
processing of regular two-dimensional images. However, as the amount of video being shared 
and analyzed increases, deep networks are also being applied to video understanding, which 
we discuss in more detail Section 6.5. We are also seeing applications in three-dimensional 
volumetric models such as occupancy maps created from range data (Section 13.5) and volu- 
metric medical images (Section 6.4.1). 

It may appear, at first glance, that the convolutional networks we have already studied, 
such as the ones illustrated in Figures 5.33, 5.34, and 5.39 already perform 3D convolutions, 
since their input receptive fields are 3D boxes in (x, y, c), where c is the (feature) channel 
dimension. So, we could in principle fit a sliding window (say in time, or elevation) into a 


2D network and be done. Or, we could use something like grouped convolutions. However, 


5l https://sites.google.com/view/self-supervised-icm12019 
>? https://github.com/facebookresearch/vissl 
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Figure 5.52 Alternative approaches to information fusion over the temporal dimensions 
(Karpathy, Toderici et al. 2014) © 2014 IEEE. 


it’s more convenient to operate on a complete 3D volume all at once, and to have weight 
sharing across the third dimension for all kernels, as well as multiple input and output feature 
channels at each layer. 

One of the earliest applications of 3D convolutions was in the processing of video data to 
classify human actions (Kim, Lee, and Yang 2007; Ji, Xu et al. 2013; Baccouche, Mamalet 
et al. 2011). Karpathy, Toderici et al. (2014) describe a number of alternative architectures 
for fusing temporal information, as illustrated in Figure 5.52. The single frame approach 
classifies each frame independently, depending purely on that frame’s content. Late fusion 
takes features generated from each frame and makes a per-clip classification. Early fusion 
groups small sets of adjacent frames into multiple channels in a 2D CNN. As mentioned 
before, the interactions across time do not have the convolutional aspects of weight sharing 
and temporal shift invariance. Finally, 3D CNNs (Ji, Xu et al. 2013) (not shown in this 
figure) learn 3D space and time-invariance kernels that are run over spatio-temporal windows 
and fused into a final score. 

Tran, Bourdev et al. (2015) show how very simple 3 x 3 x 3 convolutions combined with 
pooling in a deep network can be used to obtain even better performance. Their C3D network 
can be thought of as the “VGG of 3D CNNs” (Johnson 2020, Lecture 18). Carreira and 
Zisserman (2017) compare this architecture to alternatives that include two-stream models 
built by analyzing pixels and optical flows in parallel pathways (Figure 6.44b). Section 6.5 
on video understanding discusses these and other architectures used for such problems, which 
have also been attacked using sequential models such as recurrent neural networks (RNNs) 
and LSTM, which we discuss in Section 5.5.2. Lecture 18 on video understanding by Johnson 
(2020) has a nice review of all these video understanding architectures. 

In addition to video processing, 3D convolutional neural networks have been applied to 
volumetric image processing. Two examples of shape modeling and recognition from range 
data, i.e., 3D ShapeNets (Wu, Song et al. 2015) and VoxNet (Maturana and Scherer 2015) 
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Figure 5.53 3D convolutional networks applied to volumetric data for object detection: 
(a) 3D ShapeNets (Wu, Song et al. 2015) O 2015 IEEE; (b) VoxNet (Maturana and Scherer 
2015) O 2015 IEEE. 


are shown in Figure 5.53. Examples of their application to medical image segmentation 
(Kamnitsas, Ferrante et al. 2016; Kamnitsas, Ledig et al. 2017) are discussed in Section 6.4.1. 
We discuss neural network approaches to 3D modeling in more detail in Sections 13.5.1 and 
14.6. 


Like regular 2D CNNs, 3D CNN architectures can exploit different spatial and temporal 
resolutions, striding, and channel depths, but they can be very computation and memory in- 
tensive. To counteract this, Feichtenhofer, Fan et al. (2019) develop a two-stream SlowFast 
architecture, where a slow pathway operates at a lower frame rate and is combined with fea- 
tures from a fast pathway with higher temporal sampling but fewer channels (Figure 6.44c). 
Video processing networks can also be made more efficient using channel-separated convolu- 
tions (Tran, Wang et al. 2019) and neural architecture search (Feichtenhofer 2020). Multigrid 
techniques (Appendix A.5.3) can also be used to accelerate the training of video recognition 
models (Wu, Girshick et al. 2020). 
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Figure 5.54 Overview of the Mesh R-CNN system (Gkioxari, Malik, and Johnson 2019) © 
2019 IEEE. A Mask R-CNN backbone is augmented with two 3D shape inference branches. 
The voxel branch predicts a coarse shape for each detected object, which is further deformed 


with a sequence of refinement stages in the mesh refinement branch. 


3D point clouds and meshes 


In addition to processing 3D gridded data such as volumetric density, implicit distance func- 
tions, and video sequences, neural networks can be used to infer 3D models from single 
images. One approach is to predict per-pixel depth, which we study in Section 12.8. Another 
is to reconstruct full 3D models represented using volumetric density (Choy, Xu et al. 2016), 
which we study in Sections 13.5.1 and 14.6. Some more recent experiments, however, sug- 
gest that some of these 3D inference networks (Tatarchenko, Dosovitskiy, and Brox 2017; 
Groueix, Fisher et al. 2018; Richter and Roth 2018) may just be recognizing the general 
object category and doing a small amount of fitting (Tatarchenko, Richter et al. 2019). 

Generating and processing 3D point clouds has also been extensively studied (Fan, Su, 
and Guibas 2017; Qi, Su et al. 2017; Wang, Sun et al. 2019). Guo, Wang et al. (2020) provide 
a comprehensive survey that reviews over 200 publications in this area. 

A final alternative is to infer 3D triangulated meshes from either RGB-D (Wang, Zhang 
et al. 2018) or regular RGB (Gkioxari, Malik, and Johnson 2019; Wickramasinghe, Fua, and 
Knott 2021) images. Figure 5.54 illustrates the components of the Mesh R-CNN system, 
which detects images of 3D objects and turns each one into a triangulated mesh after first 
reconstructing a volumetric model. The primitive operations and representations needed to 


process such meshes using deep neural networks can be found in the PyTorch3D library.** 


53https://github.com/facebookresearch/pytorch3d 


5.5 More complex models 321 


output outputo output output2 output3 output4 


fa 


RI 


C) SD RD RD RD ED 
|) 4 eee 

E A) m 2 AAA AAA 
eee 

$ E ED 
(a) (b) 


Figure 5.55 A deep recurrent neural network (RNN) uses multiple stages to process se- 
quential data, with the output of one stage feeding the input of the next O Glassner (2018). 
Each stage maintains its own state and backpropagates its own gradients, although weights 
are shared between all stages. Column (a) shows a more compact rolled-up diagram, while 


column (b) shows the corresponding unrolled version. 


5.5.2 Recurrent neural networks 


While 2D and 3D convolutional networks are a good fit for images and volumes, sometimes 
we wish to process a sequence of images, audio signals, or text. A good way to exploit pre- 
viously seen information is to pass features detected at one time instant (e.g., video frame) 
as input to the next frame’s processing. Such architectures are called Recurrent Neural Net- 
works (RNNs) and are described in more detail in Goodfellow, Bengio, and Courville (2016, 
Chapter 10) and Zhang, Lipton et al. (2021, Chapter 8). Figure 5.55 shows a schematic 
sketch of such an architecture. Deep network layers not only pass information on to subse- 
quent layers (and an eventual output), but also feed some of their information as input to the 
layer processing the next frame of data. Individual layers share weights across time (a bit like 
3D convolution kernels), and backpropagation requires computing derivatives for all of the 
“unrolled” units (time instances) and summing these derivatives to obtain weight updates. 
Because gradients can propagate for a long distance backward in time, and can therefore 
vanish or explode (just as in deep networks before the advent of residual networks), it is also 
possible to add extra gating units to modulate how information flows between frames. Such 
architectures are called Gated Recurrent Units (GRUs) and Long short-term memory (LSTM) 
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(Hochreiter and Schmidhuber 1997; Zhang, Lipton et al. 2021, Chapter 9). 

RNNs and LSTMs are often used for video processing, since they can fuse information 
over time and model temporal dependencies (Baccouche, Mamalet et al. 2011; Donahue, 
Hendricks et al. 2015; Ng, Hausknecht et al. 2015; Srivastava, Mansimov, and Salakhudinov 
2015; Ballas, Yao ef al. 2016), as well as language modeling, image captioning, and visual 
question answering. We discuss these topics in more detail in Sections 6.5 and 6.6. They have 
also occasionally been used to merge multi-view information in stereo (Yao, Luo et al. 2019; 
Riegler and Koltun 2020a) and to simulate iterative flow algorithms in a fully differentiable 
(and hence trainable) manner (Hur and Roth 2019; Teed and Deng 2020b). 

To propagate information forward in time, RNNs, GRUs, and LSTMs need to encode all 
of the potentially useful previous information in the hidden state being passed between time 
steps. In some situations, it is useful for a sequence modeling network to look further back 
(or even forward) in time. This kind of capability is often called attention and is described 
in more detail in Zhang, Lipton et al. (2021, Chapter 10), Johnson (2020, Lecture 12), and 
Section 5.5.3 on transformers. In brief, networks with attention store lists of keys and values, 
which can be probed with a query to return a weighted blend of values depending on the 
alignment between the query and each key. In this sense, they are similar to kernel regres- 
sion (4.12-4,14), which we studied in Section 4.1.1, except that the query and the keys are 
multiplied (with appropriate weights) before being passed through a softmax to determine the 
blending weights. 

Attention can either be used to look backward at the hidden states in previous time in- 
stances (which is called self-attention), or to look at different parts of the image (visual at- 
tention, as illustrated in Figure 6.46). We discuss these topics in more detail in Section 6.6 
on vision and language. When recognizing or generating sequences, such as the words in 
a sentence, attention modules often used to work in tandem with sequential models such as 
RNNs or LSTMs. However, more recent works have made it possible to apply attention to 
the entire input sequence in one parallel step, as described in Section 5.5.3 on transformers. 

The brief descriptions in this section just barely skim the broad topic of deep sequence 
modeling, which is usually covered in several lectures in courses on deep learning (e.g., John- 
son 2020, Lectures 12-13) and several chapters in deep learning textbooks (Zhang, Lipton et 
al. 2021, Chapters 8-10). Interested readers should consult these sources for more detailed 


information. 


5.5.3 Transformers 


Transformers, which are a novel architecture that adds attention mechanisms (which we de- 


scribe below) to deep neural networks, were first introduced by Vaswani, Shazeer et al. (2017) 
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in the context of neural machine translation, where the task consists of translating text from 
one language to another (Mikolov, Sutskever et al. 2013). In contrast to RNNs and their vari- 
ants (Section 5.5.2), which process input tokens one at a time, transformers can to operate on 
the entire input sequence at once. In the years after first being introduced, transformers be- 
came the dominant paradigm for many tasks in natural language processing (NLP), enabling 
the impressive results produced by BERT (Devlin, Chang et al. 2018), RoBERTa (Liu, Ott 
et al. 2019), and GPT-3 (Brown, Mann et al. 2020), among many others. Transformers then 
began seeing success when processing the natural language component and later layers of 
many vision and language tasks (Section 6.6). More recently, they have gained traction in 
pure computer vision tasks, even outperforming CNNs on several popular benchmarks. 

The motivation for applying transformers to computer vision is different than that of ap- 
plying it to NLP. Whereas RNNs suffer from sequentially processing the input, convolutions 
do not have this problem, as their operations are already inherently parallel. Instead, the 
problem with convolutions has to do with their inductive biases, i.e., the default assumptions 
encoded into convolutional models. 

A convolution operation assumes that nearby pixels are more important than far away 
pixels. Only after several convolutional layers are stacked together does the receptive field 
grow large enough to attend to the entire image (Araujo, Norris, and Sim 2019), unless the 
network is endowed with non-local operations (Wang, Girshick ef al. 2018) similar to those 
used in some image denoising algorithms (Buades, Coll, and Morel 2005a). As we have seen 
in this chapter, convolution’s spatial locality bias has led to remarkable success across many 
aspects of computer vision. But as datasets, models, and computational power grow by orders 
of magnitude, these inductive biases may become a factor inhibiting further progress.** 

The fundamental component of a transformer is self-attention, which is itself built out of 
applying attention to each of N unit activations in a given layer in the network." Attention 
is often described using an analogy to the concept of associative maps or dictionaries found 
as data structures in programming languages and databases. Given a set of key-value pairs, 
[(k;, v;)} and a query q, a dictionary returns the value v; corresponding to the key k; that 
exactly matches the query. In neural networks, the key and query values are real-valued vec- 
tors (e.g., linear projections of activations), so the corresponding operation returns a weighted 
sum of values where the weights depend on the pairwise distances between a query and the 


set of keys. This is basically the same as scattered data interpolation, which we studied in 


544Rich Sutton, a pioneer in reinforcement learning, believes that learning to leverage computation, instead of en- 
coding human knowledge, is the bitter lesson to learn from the history of AI research (Sutton 2019). Others disagree 
with this view, believing that it is essential to be able to learn from small amounts of data (Lake, Salakhutdinov, and 
Tenenbaum 2015; Marcus 2020). 

55 N may indicate the number of words in a sentence or patches in an image 
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Figure 5.56 The self-attention computation graph to compute a single output vector x4, 
courtesy of Matt Deitke, adapted from Vaswani, Huang, and Manning (2019). Note that the 
full self-attention operation also computes outputs for Xx, xX}, and x} by shifting the input 
to the query (Xə in this case) between x, X3, and X4, respectively. For each of matmuly, 


matmul x, and matmulg, there is a single matrix of weights that gets reused with each call. 


Section 4.1.1, as pointed out in Zhang, Lipton et al. (2021, Section 10.2). However, instead 
of using radial distances as in (4.14), attention mechanisms in neural networks more com- 
monly use scaled dot-product attention (Vaswani, Shazeer et al. 2017; Zhang, Lipton et al. 
2021, Section 10.3.3), which involves taking the dot product between the query and key vec- 
tors, scaling down by the square root of the dimension of these embeddings D,% and then 
applying the softmax function of (5.5), i.e., 


y = > a(q-k;/D)v; = softmax(q™K/D)"V, (5.77) 


where K and V are the row-stacked matrices composed of the key and value vectors, respec- 
tively, and y is the output of the attention operator.*’ 
Given a set of input vectors {xo, X1,- -, Xy—1 }, the self-attention operation produces a 


set of output vectors {x9,x{,...,Xjy_,}. Figure 5.56 shows the case for N = 4, where 


56We divide the dot product by VD so that the variance of the scaled dot product does not increase for larger 
embedding dimensions, which could result in vanishing gradients. 
57The partition of unity function o. notation is borrowed from Zhang, Lipton et al. (2021, Section 10.3). 
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the self-attention computation graph is used to obtain a single output vector x}. As pictured, 


self-attention uses three learned weight matrices, Wg, Wx, and Wy, which determine the 
q; = WaXi, k; = WkX;, and v; = W,x; (5.78) 


per-unit query, key, and value vectors going into each attention block. The weighted sum of 
values is then optionally passed through a multi-layer perceptron (MLP) to produce x4. 

In comparison to a fully connected or convolutional layer, self-attention computes each 
output (e.g., x) based on all of the input vectors {xo,x1,...,X~—1}. In that sense, it is often 
compared to a fully connected layer, but instead of the weights being fixed for each input, the 
weights are adapted on the spot, based on the input (Khan, Naseer et al. 2021). Compared to 
convolutions, self-attention is able to attend to every part of the input from the start, instead 
of constraining itself to local regions of the input, which may help it introduce the kind of 
context information needed to disambiguate the objects shown in Figure 6.8. 

There are several components that are combined with self-attention to produce a trans- 
former block, as described in Vaswani, Shazeer et al. (2017). The full transformer consists 
of both an encoder and a decoder block, although both share many of the same components. 
In many applications, an encoder can be used without a decoder (Devlin, Chang et al. 2018; 
Dosovitskiy, Beyer et al. 2021) and vice versa (Razavi, van den Oord, and Vinyals 2019). 

The right side of Figure 5.57 shows an example of a transformer encoder block. For both 


the encoder and decoder: 


e Instead of modeling set-to-set operations, we can model sequence-to-sequence opera- 
tions by adding a positional encoding to each input vector (Gehring, Auli et al. 2017). 
The positional encoding typically consists of a set of temporally shifted sine waves 
from which position information can be decoded. (Such position encodings have also 
recently been added to implicit neural shape representations, which we study in Sec- 
tions 13.5.1 and 14.6.) 


In lieu of applying a single self-attention operation to the input, multiple self-attention 
operations, with different learned weight matrices to build different keys, values, and 
queries, are often joined together to form multi-headed self-attention (Vaswani, Shazeer 
et al. 2017). The result of each head is then concatenated together before everything is 
passed through an MLP. 


Layer normalization (Ba, Kiros, and Hinton 2016) is then applied to the output of the 
MLP. Each vector may then independently be passed through another MLP with shared 


weights before layer normalization is applied again. 
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e Residual connections (He, Zhang et al. 2016a) are employed after multi-headed atten- 
tion and after the final MLP. 


During training, the biggest difference in the decoder is that some of the input vectors to 
self-attention may be masked out, which helps support parallel training in autoregressive 
prediction tasks. Further exposition of the details and implementation of the transformer 
architecture is provided in Vaswani, Shazeer et al. (2017) and in the additional reading (Sec- 
tion 5.6). 

A key challenge of applying transformers to the image domain has to do with the size of 
image input (Vaswani, Shazeer et al. 2017). Let N denote the length of the input, D denote 
the number of dimensions for each input entry, and K denote a convolution’s (on side) kernel 
size. The number of floating point operations (FLOPs) required for self-attention is on 
the order of O(N?D), whereas the FLOPs for a convolution operation is on the order of 
O(N D?K?),. For instance, with an ImageNet image scaled to size 224 x 224 x 3, if each 
pixel is treated independently, N = 224 x 224 = 50176 and D = 3. Here, a convolution is 
significantly more efficient than self-attention. In contrast, applications like neural machine 
translation may only have N as the number of words in a sentence and D as the dimension 
for each word embedding (Mikolov, Sutskever et al. 2013), which makes self-attention much 
more efficient. 

The Image Transformer (Parmar, Vaswani et al. 2018) was the first attempt at applying 
the full transformer model to the image domain, with many of the same authors that intro- 
duced the transformer. It used both an encoder and decoder to try and build an autoregressive 
generative model that predicts the next pixel, given a sequence of input pixels and all the 
previously predicted pixels. (The earlier work on non-local networks by Wang, Girshick et 
al. (2018) also used ideas inspired by transformers, but with a simpler attention block and a 
fully two-dimensional setup.) Each vector input to the transformer corresponded to a single 
pixel, which ultimately constrained them to generate small images (i.e., 32 x 32), since the 
quadratic cost of self-attention was too expensive otherwise. 

Dosovitskiy, Beyer et al. (2021) had a breakthrough that allowed transformers to process 
much larger images. Figure 5.57 shows the diagram of the model, named the Vision Trans- 
former (ViT). For the task of image recognition, instead of treating each pixel as a separate 
input vector to the transformer, they divide an image (of size 224 x 224) into 196 distinct 
16 x 16 gridded image patches. Each patch is then flattened, and passed through a shared 
embedding matrix, which is equivalent to a strided 16 x 16 convolution, and the results are 


combined with a positional encoding vector and then passed to the transformer. Earlier work 


58In Section 5.4 on convolutional architectures, we use C to denote the number of channels instead of D to denote 


the embedding dimensions. 
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Figure 5.57 The Vision Transformer (ViT) model from (Dosovitskiy, Beyer et al. 2021) 
breaks an image into a 16 x 16 grid of patches. Each patch is then flattened, passed through 
a shared embedding matrix, and combined with a positional encoding vector. These inputs 
are then passed through a transformer encoder (right) several times before predicting an 


image’s class. 


from Cordonnier, Loukas, and Jaggi (2019) introduced a similar patching approach, but on a 
smaller scale with 2 x 2 patches. 

ViT was only able to outperform their convolutional baseline BiT (Kolesnikov, Beyer et 
al. 2020) when using over 100 million training images from JFT-300M (Sun, Shrivastava 
et al. 2017). When using ImageNet alone, or a random subset of 10 or 30 million training 
samples from JPT-300, the ViT model typically performed much worse than the BiT baseline. 
Their results suggest that in low-data domains, the inductive biases present in convolutions 
are typically quite useful. But, with orders of magnitude of more data, a transformer model 
might discover even better representations that are not representable with a CNN. 

Some works have also gone into combining the inductive biases of convolutions with 
transformers (Srinivas, Lin et al. 2021; Wu, Xiao et al. 2021; Lu, Batra et al. 2019; Yuan, Guo 
et al. 2021). An influential example of such a network is DETR (Carion, Massa et al. 2020), 
which is applied to the task of object detection. It first processes the image with a ResNet 
backbone, with the output getting passed to a transformer encoder-decoder architecture. They 
find that the addition of a transformer improves the ability to detect large objects, which 
is believed to be because of its ability to reason globally about correspondences between 
inputted encoding vectors. 

The application and usefulness of transformers in the realm of computer vision is still 


being widely researched. Already, however, they have achieved impressive performance on 
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a wide range of tasks, with new papers being published rapidly.*? Some more notable appli- 
cations include image classification (Liu, Lin et al. 2021; Touvron, Cord et al. 2020), object 
detection (Dai, Cai et al. 2020; Liu, Lin et al. 2021), image pre-training (Chen, Radford et 
al. 2020), semantic segmentation (Zheng, Lu et al. 2020), pose recognition (Li, Wang et al. 
2021), super-resolution (Zeng, Fu, and Chao 2020), colorization (Kumar, Weissenborn, and 
Kalchbrenner 2021), generative modeling (Jiang, Chang, and Wang 2021; Hudson and Zit- 
nick 2021), and video classification (Arnab, Dehghani et al. 2021; Fan, Xiong et al. 2021; 
Li, Zhang et al. 2021). Recent works have also found success extending ViT’s patch em- 
bedding to pure MLP vision architectures (Tolstikhin, Houlsby ef al. 2021; Liu, Dai et al. 
2021; Touvron, Bojanowski et al. 2021). Applications to vision and language are discussed 


in Section 6.6. 


5.5.4 Generative models 


Throughout this chapter, I have mentioned that machine learning algorithms such as logis- 
tic regression, support vector machines, random trees, and feedforward deep neural networks 
are all examples of discriminative systems that never form an explicit generative model of the 
quantities they are trying to estimate (Bishop 2006, Section 1.5; Murphy 2012, Section 8.6). 
In addition to the potential benefits of generative models discussed in these two textbooks, 
Goodfellow (2016) and Kingma and Welling (2019) list some additional ones, such as the 
ability to visualize our assumptions about our unknowns, training with missing or incom- 
pletely labeled data, and the ability to generate multiple, alternative, results. 

In computer graphics, which is sometimes called image synthesis (as opposed to the im- 
age understanding or image analysis we do in computer vision), the ability to easily gener- 
ate realistic random images and models has long been an essential tool. Examples of such 
algorithms include texture synthesis and style transfer, which we study in more detail in Sec- 
tion 10.5, as well as fractal terrain (Fournier, Fussel, and Carpenter 1982) and tree generation 
(Prusinkiewicz and Lindenmayer 1996). Examples of deep neural networks being used to 
generate such novel images, often under user control, are shown in Figures 5.60 and 10.58. 
Related techniques are also used in the nascent field of neural rendering, which we discuss 
in Section 14.6. 

How can we unlock the demonstrated power of deep neural networks to capture seman- 
tics in order to visualize sample images and generate new ones? One approach could be to 
use the visualization techniques introduced in Section 5.4.5. But as you can see from Fig- 
ure 5.49, while such techniques can give us insights into individual units, they fail to create 


S https://github.com/dk-liang/Awesome- Visual- Transformer 
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fully realistic images. 

Another approach might be to construct a decoder network to undo the classification 
performed by the original (encoder) network. This kind of “bottleneck” architecture is widely 
used, as shown in Figure 5.37a, to derive semantic per-pixel labels from images. Can we use 


a similar idea to generate realistic looking images? 


Variational autoencoders 


A network that encodes an image into small compact codes and then attempts to decode it 
back into the same image is called an autoencoder. The compact codes are typically rep- 
resented as a vector, which is often called the latent vector to emphasize that 1t is hidden 
and unknown. Autoencoders have a long history of use in neural networks, even predating 
today’s feedforward networks (Kingma and Welling 2019). It was once believed that this 
might be a good way to pre-train networks, but the more challenging proxy tasks we studied 
in Section 5.4.7 have proven to be more effective. 

At a high level, to train an autoencoder on a dataset of images, we can use an unsupervised 
objective that tries to have the output image of the decoder match the training image input to 
the encoder. To generate a new image, we can then randomly sample a latent vector and hope 
that from that vector, the decoder can generate a new image that looks like 1t came from the 
distribution of training images in our dataset. 

With an autoencoder, there is a deterministic, one-to-one mapping from each input to 
its latent vector. Hence, the number of latent vectors that are generated exactly matches the 
number of input data points. If the encoder’s objective is to produce a latent vector that makes 
it easy to decode, one possible solution would be for every latent vector to be extremely far 
away from every other latent vector. Here, the decoder can overfit all the latent vectors it has 
seen since they would all be unique with little overlap. However, as our goal is to randomly 
generate latent vectors that can be passed to the decoder to generate realistic images, we want 
the latent space to both be well explored and to encode some meaning, such as nearby vectors 
being semantically similar. Ghosh, Sajjadi et al. (2019) propose one potential solution, where 
they inject noise into the latent vector and empirically find that it works quite well. 

Another extension of the autoencoder is the variational autoencoder (VAE) (Kingma and 
Welling 2013; Rezende, Mohamed, and Wierstra 2014; Kingma and Welling 2019). Instead 
of generating a single latent vector for each input, it generates the mean and covariance that 
define a chosen distribution of latent vectors. The distribution can then be sampled from 
to produce a single latent vector, which gets passed into the decoder. To avoid having the 
covariance matrix become the zero matrix, making the sampling process deterministic, the 


objective function often includes a regularization term to penalize the distribution if it is far 
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Figure 5.58 The VO-VAE model. On the left, ze(x) represents the output of the encoder, 
the embedding space on top represents the codebook of K embedding vectors, and q(z | x) 
represents the process of replacing each spatial (i.e., channel-wise) vector in the output of 
the encoder with its nearest vector in the codebook. On the right, we see how a z¿(1) vector 
(green) may be rounded to €2, and that the gradient in the encoder network (red) may push 
the vector away from ex during backpropagation. O van den Oord, Vinyals, and Kavukcuoglu 
(2017) 


from some chosen (e.g., Gaussian) distribution. Due to their probabilistic nature, VAEs can 
explore the space of possible latent vectors significantly better than autoencoders, making it 
harder for the decoder to overfit the training data. 


Motivated by how natural language is discrete and by how images can typically be de- 
scribed in language (Section 6.6), the vector quantized VAE (VQ-VAE) of van den Oord, 
Vinyals, and Kavukcuoglu (2017) takes the approach of modeling the latent space with cate- 
gorical variables Figure 5.58 shows an outline of the VQ-VAE architecture. The encoder and 
decoder operate like a normal VAE, where the encoder predicts some latent representation 
from the input, and the decoder generates an image from the latent representation. However, 
in contrast to the normal VAE, the VQ-VAE replaces each spatial dimension of the predicted 
latent representation with its nearest vector from a discrete set of vectors (named the code- 
book). The discretized latent representation is then passed to the decoder. The vectors in the 
codebook are trained simultaneously with the VAE’s encoder and decoder. Here, the code- 


book vectors are optimized to move closer to the spatial vectors outputted by the encoder. 


Although a VQ-VAE uses a discrete codebook of vectors, the number of possible images 
it can represent is still monstrously large. In some of their image experiments, they set the 
size of the codebook to K = 512 vectors and set the size of the latent variable to be z = 32 


932-32-1 


x 32 x 1. Here, they can represent 51 possible images. 


Compared to a VAE, which typically assumes a Gaussian latent distribution, the latent 
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distribution of a VQ-VAE is not as clearly defined, so a separate generative model is trained 
to sample latent variables z. The model is trained on the final latent variables outputted from 
the trained VQ-VAE encoder across the training data. For images, entries in z are often 
spatially dependent, e.g., an object may be encoded over many neighboring entries. With 
entries being chosen from a discrete codebook of vectors, we can use a Pixel CNN (van den 
Oord, Kalchbrenner et al. 2016) to autoregressively sample new entries in the latent variable 
based on previously sampled neighboring entries. The Pixel CNN can also be conditionally 
trained, which enables the ability to sample latent variables corresponding to a particular 
image class or feature. 

A follow-up to the VQ-VAE model, named VQ-VAE-2 (Razavi, van den Oord, and 
Vinyals 2019), uses a two-level approach to decoding images, where with both a small and 
large latent vector, they can get much higher fidelity reconstructed and generated images. 
Section 6.6 discusses Dall-E (Ramesh, Pavlov ef al. 2021), a model that applies VQ-VAE-2 
to text-to-image generation and achieves remarkable results. 


Generative adversarial networks 


Another possibility for image synthesis is to use the multi-resolution features computed by 
pre-trained networks to match the statistics of a given texture or style image, as described 
in Figure 10.57. While such networks are useful for matching the style of a given artist and 
the high-level content (layout) of a photograph, they are not sufficient to generate completely 
photorealistic images. 

In order to create truly photorealistic synthetic images, we want to determine if an image 
is real(istic) or fake. If such a loss function existed, we could use it to train networks to 
generate synthetic images. But, since such a loss function is incredibly difficult to write 
by hand, why not train a separate neural network to play the critic role? This is the main 
insight behind the generative adversarial networks introduced by Goodfellow, Pouget-Abadie 
et al. (2014). In their system, the output of the generator network G is fed into a separate 
discriminator network D, whose task is to tell “fake” synthetically generated images apart 
from real ones, as shown in Figure 5.59a. The goal of the generator is to create images that 
“fool” the discriminator into accepting them as real, while the goal of the discriminator is to 
catch the “forger” in their act. Both networks are co-trained simultaneously, using a blend 
of loss functions that encourage each network to do its job. The joint loss function can be 
written as 


Ecan(wa, wp) = » log D(xn) + log (1 — D(G(2»))), (5.79) 


n 


where the {xn} are the real-world training images, {Zn} are random vectors, which are 
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Figure 5.59 Generative adversarial network (GAN) architectures from Pan, Yu et al. 
(2019) © 2019 IEEE. (a) In a regular GAN, random “latent” noise vectors z are fed into 
a generator network G, which produces synthetic “fake” images x' = G(z). The job of the 
discriminator D is to tell the fake images apart from real samples x. (b) In a conditional 
GAN (cGAN), the network iterates (during training) over all the classes that we wish to syn- 
thesize. The generator G gets both a class id c and a random noise vector z as input, and the 
discriminator D gets the class id as well and needs to determine if its input is a real member 
of the given class. (c) The discriminator in an InfoGAN does not have access to the class id, 


but must instead infer it from the samples it is given. 


passed through the generator G to produce synthetic images x”,, and the {wg, wp} are the 
weights (parameters) in the generator and discriminator. 

Instead of minimizing this loss, we adjust the weights of the generator to minimize the 
second term (they do not affect the first), and adjust the weights of the discriminator to max- 
imize both terms, 1.e., minimize the discriminator’s error. This process is often called a min- 
imax game. More details about the formulation and how to optimize it can be found in 
the original paper by Goodfellow, Pouget-Abadie et al. (2014), as well as deep learning text- 
books (Zhang, Lipton et al. 2021, Chapter 17), lectures (Johnson 2020, Lecture 20), tutorials 
(Goodfellow, Isola et al. 2018), and review articles (Creswell, White et al. 2018; Pan, Yu et 
al. 2019). 

The original paper by Goodfellow, Pouget-Abadie et al. (2014) used a small, fully con- 
nected network to demonstrate the basic idea, so it could only generate 32 x 32 images 
such as MNIST digits and low-resolution faces. The Deep Convolutional GAN (DCGAN) 
introduced by Radford, Metz, and Chintala (2015) uses the second half of the deconvolution 
network shown in Figure 5.37a to map from the random latent vectors z to arbitrary size 


images and can therefore generate a much wider variety of outputs, while LAPGAN uses 


5% Note that the term adversarial in GANS refers to this adversarial game between the generator and the discrimi- 
nator, which helps the generator create better pictures. This is distinct from the adversarial examples we discussed 


in Section 5.4.6, which are images designed to fool recognition systems. 
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a Laplacian pyramid of adversarial networks (Denton, Chintala et al. 2015). Blending be- 
tween different latent vectors (or perturbing them in certain directions) generates in-between 


synthetic images. 


GANs and DCGANs can be trained to generate new samples from a given class, but it is 
even more useful to generate samples from different classes using the same trained network. 
The conditional GAN (cGAN) proposed by Mirza and Osindero (2014) achieves this by feed- 
ing a class vector into both the generator, which conditions its output on this second input, as 
well as the discriminator, as shown in Figure 5.59b. It is also possible to make the discrim- 
inator predict classes that correlate with the class vector using an extra mutual information 
term, as shown in Figure 5.59c (Chen, Duan ef al. 2016). This allows the resulting InfoGAN 
network to learn disentangled representations, such as the digit shapes and writing styles in 
MNIST, or pose and lighting. 


While generating random images can have many useful graphics applications, such as 
generating textures, filling holes, and stylizing photographs, as discussed in Section 10.5, 
it becomes even more useful when it can be done under a person’s artistic control (Lee, 
Zitnick, and Cohen 2011). The iGAN interactive image editing system developed by Zhu, 
Krahenbiihl et al. (2016) does this by learning a manifold of photorealistic images using a 
generative adversarial network and then constraining user edits (or even sketches) to produce 


images that lie on this manifold. 


This approach was generalized by Isola, Zhu et al. (2017) to all kinds of other image-to- 
image translation tasks, as shown in Figure 5.60a. In their pix2pix system, images, which 
can just be sketches or semantic labels, are fed into a modified U-Net, which converts them 
to images with different semantic meanings or styles (e.g., photographs or road maps). When 
the input is a semantic label map and the output is a photorealistic image, this process is often 
called semantic image synthesis. The translation network is trained with a conditional GAN, 
which takes paired images from the two domains at training time and has the discriminator 
decide if the synthesized (translated) image together with the input image are a real or fake 
pair. Referring back to Figure 5.59b, the class c is now a complete image, which is fed 
into both G and the discriminator D, along with its paired or synthesized output. Instead of 
making a decision for the whole image, the discriminator looks at overlapping patches and 
makes decisions on a patch-by-patch basis, which requires fewer parameters and provides 
more training data and more discriminative feedback. In their implementation, there is no 
random vector z; instead, dropout is used during both training and “test” (translation) time, 


which is equivalent to injecting noise at different levels in the network. 


In many situations, paired images are not available, e.g., when you have collections of 


paintings and photographs from different locations, or pictures of animals in two different 
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Figure 5.60 = Image-to-image translation. (a) Given paired training images, the original 
pix2pix system learns how to turn sketches into photos, semantic maps to images, and other 
pixel remapping tasks (Isola, Zhu et al. 2017) © 2017 IEEE. (b) CycleGAN does not require 
paired training images, just collections coming from different sources, such as painting and 
Photographs or horses and zebras (Zhu, Park et al. 2017) © 2017 IEEE. 


classes, as shown in Figure 5.60b. In this case, a cycle-consistent adversarial network (Cycle- 
GAN) can be used to require the mappings between the two domains to encourage identity, 
while also ensuring that generated images are perceptually similar to the training images 
(Zhu, Park et al. 2017). DualGAN (Yi, Zhang et al. 2017) and DiscoGAN (Kim, Cha et al. 
2017) use related ideas. The BicycleGAN system of Zhu, Zhang et al. (2017) uses a similar 
idea of transformation cycles to encourage encoded latent vectors to correspond to different 
modes in the outputs for better interpretability and control. 

Since the publication of the original GAN paper, the number of extensions, applications, 
and follow-on papers has exploded. The GAN Zoo website®! lists over 500 GAN papers pub- 
lished between 2014 and mid-2018, at which point it stopped being updated. Large number 


6l https://github.com/hindupuravinash/the-gan-z0o0 
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Semantic Pyramid Generation Levels 
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Figure 5.61 The Semantic Image Pyramid can be used to choose which semantic level in 
a deep network to modify when editing an image (Shocher, Gandelsman et al. 2020) © 2020 
IEEE. 


of papers continue to appear each year in vision, machine learning, and graphics conferences. 

Some of the more important papers since 2017 include Wasserstein GANs (Arjovsky, 
Chintala, and Bottou 2017), Progressive GANs (Karras, Aila et al. 2018), UNIT (Liu, Breuel, 
and Kautz 2017) and MUNIT (Huang, Liu ef al. 2018), spectral normalization (Miyato, 
Kataoka et al. 2018), SAGAN (Zhang, Goodfellow et al. 2019), BigGAN (Brock, Donahue, 
and Simonyan 2019), StarGAN (Choi, Choi et al. 2018) and StyleGAN (Karras, Laine, and 
Aila 2019) and follow-on papers (Choi, Uh et al. 2020; Karras, Laine et al. 2020; Viazovet- 
skyi, Ivashkin, and Kashin 2020), SPADE (Park, Liu et al. 2019), GANSpace (Härkönen, 
Hertzmann et al. 2020), and VQGAN (Esser, Rombach, and Ommer 2020). You can find 
more detailed explanations and references to many more papers in the lectures by John- 
son (2020, Lecture 20), tutorials by Goodfellow, Isola et al. (2018), and review articles by 
Creswell, White et al. (2018), Pan, Yu et al. (2019), and Tewari, Fried et al. (2020). 

In summary, generative adversarial networks and their myriad extensions continue to 
be an extremely vibrant and useful research area, with applications such as image super- 
resolution (Section 10.3), photorealistic image synthesis (Section 10.5.3), image-to-image 
translation, and interactive image editing. Two very recent examples of this last applica- 
tion are the Semantic Pyramid for Image Generation by Shocher, Gandelsman et al. (2020), 
in which the semantic manipulation level can be controlled (from small texture changes to 
higher-level layout changes), as shown in Figure 5.61, and the Swapping Autoencoder by 
Park, Zhu et al. (2020), where structure and texture can be independently edited. 
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5.6 Additional reading 


Machine learning and deep learning are rich, broad subjects which properly deserve their own 
course of study to master. Fortunately, there are a large number of good textbooks and online 


courses available to learn this material. 


My own favorite for machine learning is the book by Bishop (2006), since it provides 
a broad treatment with a Bayesian flavor and excellent figures, which I have re-used in this 
book. The books by Glassner (2018, 2021) provide an even gentler introduction to both 
classic machine learning and deep learning, as well as additional figures I reference in this 
book. Two additional widely used textbooks for machine learning are Hastie, Tibshirani, 
and Friedman (2009) and Murphy (2012). Deisenroth, Faisal, and Ong (2020) provide a 
nice compact treatment of mathematics for machine learning, including linear and matrix 
algebra, probability theory, model fitting, regression, PCA, and SVMs, with a more in-depth 
exposition than the terse summaries I provide in this book. The book on Automated Machine 
Learning edited by Hutter, Kotthoff, and Vanschoren (2019) surveys automated techniques 
for designing and optimizing machine learning algorithms. 


For deep learning, Goodfellow, Bengio, and Courville (2016) were the first to provide a 
comprehensive treatment, but it has not recently been revised. Glassner (2018, 2021) provides 
a wonderful introduction to deep learning, with lots of figures and no equations. I recommend 
it even to experienced practitioners since it helps develop and solidify intuitions about how 
learning works. An up-to-date reference on deep learning is the Dive into Deep Learning on- 
line textbook by Zhang, Lipton et al. (2021), which comes with interactive Python notebooks 
sprinkled throughout the text, as well as an associated course (Smola and Li 2019). Some 
introductory courses to deep learning use Charniak (2019). 


Rawat and Wang (2017) provide a nice review article on deep learning, including a history 
of early and later neural networks, as well in-depth discussion of many deep learning com- 
ponents, such as pooling, activation functions, losses, regularization, and optimization. Ad- 
ditional surveys related to advances in deep learning include Sze, Chen et al. (2017), Elsken, 
Metzen, and Hutter (2019), Gu, Wang et al. (2018), and Choudhary, Mishra et al. (2020). 
Sejnowski (2018) provides an in-depth history of the early days of neural networks. 


The Deep Learning for Computer Vision course slides by Johnson (2020) are an outstand- 
ing reference and a great way to learn the material, both for the depth of their information 
and how up-to-date the presentations are kept. They are based on Stanford’s CS231n course 
(Li, Johnson, and Yeung 2019), which is also a great up-to-date source. Additional classes 
on deep learning with slides and/or video lectures include Grosse and Ba (2019), McAllester 
(2020), Leal-Taixé and Niefiner (2020), Leal-Taixé and NieBner (2021), and Geiger (2021) 
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For transformers, Bloem (2019) provides a nice starting tutorial on implementing the stan- 
dard transformer encoder and decoder block in PyTorch, from scratch. More comprehensive 
surveys of transformers applied to computer vision include Khan, Naseer et al. (2021) and 
Han, Wang ef al. (2020). Tay, Dehghani et al. (2020) provides an overview of many attempts 
to reduce the quadratic cost of self-attention. Wightman (2021) makes available a fantastic col- 
lection of computer vision transformer implementations in PyTorch, with pre-trained weights 
and great documentation. Additional course lectures introducing transformers with videos 
and slides include Johnson (2020, Lecture 13), Vaswani, Huang, and Manning (2019, Lec- 
ture 14) and LeCun and Canziani (2020, Week 12). 

For GANs, the new deep learning textbook by Zhang, Lipton et al. (2021, Chapter 17), 
lectures by Johnson (2020, Lecture 20), tutorials by Goodfellow, Isola et al. (2018), and 
review articles by Creswell, White et al. (2018), Pan, Yu et al. (2019), and Tewari, Fried 
et al. (2020) are all good sources. For a survey of the latest visual recognition techniques, 
the tutorials presented at ICCV (Xie, Girshick et al. 2019), CVPR (Girshick, Kirillov et al. 
2020), and ECCV (Xie, Girshick et al. 2020) are excellent up-to-date sources. 


5.7 Exercises 


Ex 5.1: Backpropagation and weight updates. Implement the forward activation, back- 
ward gradient and error propagation, and weight update steps in a simple neural network. 
You can find examples of such code in HW3 of the 2020 UW CSE 576 class” or the Educa- 
tional Framework (EDF) developed by McAllester (2020) and used in Geiger (2021). 


Ex 5.2: LeNet. Download, train, and test a simple “LeNet” (LeCun, Bottou ef al. 1998) 
convolutional neural network on the CIFAR-10 (Krizhevsky 2009) or Fashion MNIST (Xiao, 
Rasul, and Vollgraf 2017) datasets. You can find such code in numerous places on the web, 
including HW4 of the 2020 UW CSE 576 class or the PyTorch beginner tutorial on Neural 
Networks.% 

Modify the network to remove the non-linearities. How does the performance change? 
Can you improve the performance of the original network by increasing the number of chan- 
nels, layers, or convolution sizes? Do the training and testing accuracies move in the same or 


different directions as you modify your network? 


Ex 5.3: Deep learning textbooks. Both the Deep Learning: From Basics to Practice book 
by Glassner (2018, Chapters 15, 23, and 24) and the Dive into Deep Learning book by Zhang, 


© https://courses.cs.washington.edu/courses/cse576/20sp/calendar/ 
63 https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html 
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Figure 5.62 Simple two hidden unit network with a ReLU activation function and no bias 
parameters for regressing the function y = |x, + 1.1x3|: (a) can you guess a set of weights 
that would fit this function?; (b) a reasonable set of starting weights; (c) a poorly scaled set 
of weights. 


Lipton et al. (2021) contain myriad graded exercises with code samples to develop your 
understanding of deep neural networks. If you have the time, try to work through most of 
these. 


Ex 5.4: Activation and weight scaling. Consider the two hidden unit network shown in 
Figure 5.62, which uses ReLU activation functions and has no additive bias parameters. Your 


task is to find a set of weights that will fit the function 
y = |zı + 1.1z9). (5.80) 
1. Can you guess a set of weights that will fit this function? 


2. Starting with the weights shown in column b, compute the activations for the hid- 
den and final units as well as the regression loss for the nine input values (11,12) € 
{-1, 0, 1} x {-1, 0, 1}. 


3. Now compute the gradients of the squared loss with respect to all six weights using the 
backpropagation chain rule equations (5.65-5.68) and sum them up across the training 


samples to get a final gradient. 


4. What step size should you take in the gradient direction, and what would your update 


squared loss become? 
5. Repeat this exercise for the initial weights in column (c) of Figure 5.62. 


6. Given this new set of weights, how much worse is your error decrease, and how many 


iterations would you expect it to take to achieve a reasonable solution? 


5.7 Exercises 339 


Start Here 


Figure 5.63 Function optimization: (a) the contour plot of f(x,y) = x? + 20y? with 
the function being minimized at (0,0); (b) ideal gradient descent optimization that quickly 


converges towards the minimum at x = 0, y = 0. 


7. Would batch normalization help in this case? 
Note: the following exercises were suggested by Matt Deitke. 


Ex 5.5: Function optimization. Consider the function f(x,y) = 1? + 20y? shown in Fig- 
ure 5.63a. Begin by solving for the following: 


1. Calculate V f, i.e., the gradient of f. 
2. Evaluate the gradient at x = —20, y = 5. 


Implement some of the common gradient descent optimizers, which should take you from 
the starting point x = —20, y = 5 to near the minimum at x = 0, y = 0. Try each of the 
following optimizers: 


1. Standard gradient descent. 
2. Gradient descent with momentum, starting with the momentum term as p = 0.99. 


3. Adam, starting with decay rates of 3; = 0.9 and bə = 0.999. 


Play around with the learning rate a. For each experiment, plot how x and y change over 
time, as shown in Figure 5.63b. 

How do the optimizers behave differently? Is there a single learning rate that makes all 
the optimizers converge towards x = 0, y = 0 in under 200 steps? Does each optimizer 


monotonically trend towards x = 0, y = 0? 


Ex 5.6: Weight initialization. For an arbitrary neural network, is it possible to initialize 
the weights of a neural network such that it will never train on any non-trivial task, such as 


image classification or object detection? Explain why or why not. 
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Ex 5.7: Convolutions. Consider convolving a 256 x 256 x 3 image with 64 separate con- 
volution kernels. For kernels with heights and widths of {(3 x 3), (5 x 5), (7 x 7), and (9 x 


9)}, answer each of the following: 
1. How many parameters (i.e., weights) make up the convolution operation? 


2. What is the output size after convolving the image with the kernels? 


Ex 5.8: Data augmentation. The figure below shows image augmentations that translate 


and scale an image. 


5 a 


Let CONV denote a convolution operation, f denote an arbitrary function (such as scaling 


or translating an image), and IMAGE denote the input image. A function f has invariance, 
with respect to a convolution, when CONV(IMAGE) = CONV(f(IMAGE)), and equivariance 
when CONV(f(IMAGE)) = f(CONV(IMAGE)). Answer and explain each of the following: 


1. Are convolutions translation invariant? 
2. Are convolutions translation equivariant? 
3. Are convolutions scale invariant? 


4. Are convolutions scale equivariant? 


Ex 5.9: Training vs. validation. Suppose your model is performing significantly better on 
the training data than it is on the validation data. What changes might be made to the loss 


function, training data, and network architecture to prevent such overfitting? 


Ex 5.10: Cascaded convolutions. With only a single matrix multiplication, how can mul- 
tiple convolutional kernel’s convolve over an entire input image? Here, let the input image be 
of size 256 x 256 x 3 and each of the 64 kernels be of size 3 x 3 x 3. 


Ex 5.11: Pooling vs. 1 x 1 convolutions. Pooling layers and 1 x 1 convolutions are both 
commonly used to shrink the size of the proceeding layer. When would you use one over the 


other? 


6 Hint: You will need to reshape the input and each convolution’s kernel size. 
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Ex 5.12: Inception. Why is an inception module more efficient than a residual block? 


What are the comparative disadvantages of using an inception module? 


Ex 5.13: ResNets. Why is it easier to train a ResNet with 100 layers than a VGG network 
with 100 layers? 


Ex 5.14: U-Nets. An alternative to the U-Net architecture is to not change the size of the 
height and width intermediate activations throughout the network. The final layer would then 
be able to output the same transformed pixel-wise representation of the input image. What is 
the disadvantage of this approach? 


Ex 5.15: Early vs. late fusion in video processing. What are two advantages of early fu- 


sion compared to late fusion? 


Ex 5.16: Video-to-video translation. Independently pass each frame in a video through a 
pix2pix model. For instance, if the video is of the day, then the output might be each frame 
at night. Stitch the output frames together to form a video. What do you notice? Does the 


video look plausible? 


Ex 5.17: Vision Transformer. Using a Vision Transformer (ViT) model, pass several im- 
ages through it and create a histogram of the activations after each layer normalization oper- 


ation. Do the histograms tend to form of a normal distribution? 


Ex 5.18: GAN training. In the GAN loss formulation, suppose the discriminator D is near- 
perfect, such that it correctly outputs near 1 for real images x,, and near 0 for synthetically 
generated images G (Zn). 


1. For both the discriminator and the generator, compute its approximate loss with 
Loan(Xn; Zn) = log D(x,) + log(1 — D(G(z,))), (5.81) 


where the discriminator tries to minimize Cgan and the generator tries to maximize 


Laan. 
2. How well can this discriminator be used to train the generator? 


3. Can you modify the generator’s loss function, min log(1 — D(G(z,,)), such that it is 


easier to train with both a great discriminator and a discriminator that is no better than 


random?° 


65 Hint: The loss function should suggest a relatively large change to fool a great discriminator and a relatively 


small change with a discriminator that is no better than random. 
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Ex 5.19: Colorization. Even though large amounts of unsupervised data can be collected 
for image colorization, it often does not train well using a pixel-wise regression loss between 
an image’s predicted colors and its true colors. Why is that? Is there another loss function 


that may be better suited for the problem? 
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A Mr. Ted sitting at a table with 
a pie and a cup of coffee. 


© 


Figure 6.1 Various kinds of recognition: (a) face recognition with pictorial structures 
(Fischler and Elschlager 1973) O 1973 IEEE; (b) instance (known object) recognition (Lowe 
1999) O 1999 IEEE; (c) real-time face detection (Viola and Jones 2004) O 2004 Springer; 
(d) feature-based recognition (Fergus, Perona, and Zisserman 2007) O 2007 Springer; (e) 
instance segmentation using Mask R-CNN (He, Gkioxari et al. 2017) O 2017 IEEE; (f) pose 
estimation (Giiler, Neverova, and Kokkinos 2018) O 2018 IEEE; (g) panoptic segmentation 
(Kirillov, He et al. 2019) O 2019 IEEE; (h) video action recognition (Feichtenhofer, Fan et 
al. 2019); (i) image captioning (Lu, Yang et al. 2018) O 2018 IEEE. 
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Of all the computer vision topics covered in this book, visual recognition has undergone 
the largest changes and fastest development in the last decade, due in part to the availability 
of much larger labeled datasets as well as breakthroughs in deep learning (Figure 5.40). In 
the first edition of this book (Szeliski 2010), recognition was the last chapter, since it was 
considered a “high-level task” to be layered on top of lower-level components such as feature 
detection and matching. In fact, many introductory vision courses still teach recognition at 
the end, often covering “classic” (non-learning) vision algorithms and applications first, and 
then shifting to deep learning and recognition. 


As I mentioned in the preface and introduction, I have now moved machine and deep 
learning to early in the book, since it is foundational technology widely used in other parts 
of computer vision. I also decided to move the recognition chapter right after deep learning, 
since most of the modern techniques for recognition are natural applications of deep neural 
networks. The majority of the old recognition chapter has been replaced with newer deep 
learning techniques, so you will sometimes find terse descriptions of classical recognition 


techniques along with pointers to the first edition and relevant surveys or seminal papers. 


A good example of the classic approach is instance recognition, where we are trying 
to find exemplars of a particular manufactured object such as a stop sign or sneaker (Fig- 
ure 6.1b). (An even earlier example is face recognition using relative feature locations, as 
shown in Figure 6.1a.) The general approach of finding distinctive features while dealing 
with local appearance variation (Section 7.1.2), and then checking for their co-occurrence 
and relative positions in an image, is still widely used for manufactured 3D object detection 
(Figure 6.3), 3D structure and pose recovery (Chapter 11), and location recognition (Sec- 
tion 11.2.3). Highly accurate and widely used feature-based approaches to instance recogni- 
tion were developed in the 2000s (Figure 7.27) and, despite more recent deep learning-based 
alternatives, are often still the preferred method (Sattler, Zhou et al. 2019). We review in- 
stance recognition in Section 6.1, although some of the needed components, such as feature 
detection, description, and matching (Chapter 7), as well as 3D pose estimation and verifica- 


tion (Chapter 11), will not be introduced until later. 


The more difficult problem of category or class recognition (e.g., recognizing members 
of highly variable categories such as cats, dogs, or motorcycles) was also initially attacked 
using feature-based approaches and relative locations (part-based models), such as the one 
depicted in Figure 6.1d. We begin our discussion of image classification (another name for 
whole-image category recognition) in Section 6.2 with a review of such “classic” (though 
now rarely used) techniques. We then show how the deep neural networks described in the 
previous chapter are ideally suited to these kinds of classification problems. Next, we cover 


visual similarity search, where instead of categorizing an image into a predefined number of 
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categories, we retrieve other images that are semantically similar. Finally, we focus on face 


recognition, which is one of the longest studied topics in computer vision. 


In Section 6.3, we turn to the topic of object detection, where we categorize not just whole 
images but delineate (with bounding boxes) where various objects are located. This topic 
includes more specialized variants such as face detection and pedestrian detection, as well as 
the detection of objects in generic categories. In Section 6.4, we study semantic segmentation, 
where the task is now to delineate various objects and materials in a pixel-accurate manner, 
i.e., to label each pixel with an object identity and class. Variants on this include instance 
segmentation, where each separate object gets a unique label, panoptic segmentation, where 
both objects and stuff (e.g., grass, sky) get labeled, and pose estimation, where pixels get 
labeled with people”s body parts and orientations. The last two sections of this chapter briefly 


touch on video understanding (Section 6.5) and vision and language (Section 6.6). 


Before starting to describe individual recognition algorithms and variants, I should briefly 
mention the critical role that large-scale datasets and benchmarks have played in the rapid ad- 
vancement of recognition systems. While small datasets such as Xerox 10 (Csurka, Dance et 
al. 2006) and Caltech-101 (Fei-Fei, Fergus, and Perona 2006) played an early role in evaluat- 
ing object recognition systems, the PASCAL Visual Object Class (VOC) challenge (Evering- 
ham, Van Gool et al. 2010; Everingham, Eslami et al. 2015) was the first dataset large and 
challenging enough to significantly propel the field forward. However, PASCAL VOC only 
contained 20 classes. The introduction of the ImageNet dataset (Deng, Dong et al. 2009; Rus- 
sakovsky, Deng et al. 2015), which had 1,000 classes and over one million labeled images, 
finally provided enough data to enable end-to-end learning systems to break through. The 
Microsoft COCO (Common Objects in Context) dataset spurred further development (Lin, 
Maire et al. 2014), especially in accurate per-object segmentation, which we study in Sec- 
tion 6.4. A nice review of crowdsourcing methods to construct such datasets is presented in 
(Kovashka, Russakovsky et al. 2016). We will mention additional, sometimes more special- 
ized, datasets throughout this chapter. A listing of the most popular and active datasets and 


benchmarks is provided in Tables 6.1-6.4. 


6.1 Instance recognition 


General object recognition falls into two broad categories, namely instance recognition and 
class recognition. The former involves re-recognizing a known 2D or 3D rigid object, poten- 


tially being viewed from a novel viewpoint, against a cluttered background, and with partial 
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Figure 6.2 Recognizing objects in a cluttered scene (Lowe 2004) © 2004 Springer. Two of 
the training images in the database are shown on the left. They are matched to the cluttered 
scene in the middle using SIFT features, shown as small squares in the right image. The affine 
warp of each recognized database image onto the scene is shown as a larger parallelogram 


in the right image. 


1 


occlusions. The latter, which is also known as category-level or generic object recogni- 


tion (Ponce, Hebert et al. 2006), is the much more challenging problem of recognizing any 
instance of a particular general class, such as “cat”, “car”, or “bicycle”. 

Over the years, many different algorithms have been developed for instance recognition. 
Mundy (2006) surveys earlier approaches, which focused on extracting lines, contours, or 
3D surfaces from images and matching them to known 3D object models. Another popu- 
lar approach was to acquire images from a large set of viewpoints and illuminations and to 
represent them using an eigenspace decomposition (Murase and Nayar 1995). More recent 
approaches (Lowe 2004; Lepetit and Fua 2005; Rothganger, Lazebnik et al. 2006; Ferrari, 
Tuytelaars, and Van Gool 2006b; Gordon and Lowe 2006; Obdržálek and Matas 2006; Sivic 
and Zisserman 2009; Zheng, Yang, and Tian 2018) tend to use viewpoint-invariant 2D fea- 
tures, such as those we will discuss in Section 7.1.2. After extracting informative sparse 2D 
features from both the new image and the images in the database, image features are matched 
against the object database, using one of the sparse feature matching strategies described in 
Section 7.1.3. Whenever a sufficient number of matches have been found, they are verified 
by finding a geometric transformation that aligns the two sets of features (Figure 6.2). 


The Microsoft COCO dataset paper (Lin, Maire et al. 2014) introduced the newer concept of instance segmen- 
tation, which is the pixel-accurate delineation of different objects drawn from a set of generic classes (Section 6.4.2). 
This now sometimes leads to confusion, unless you look at these two terms (instance recognition vs. segmentation) 


carefully. 
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(d) 


Figure 6.3 3D object recognition with affine regions (Rothganger, Lazebnik et al. 2006) O 
2006 Springer: (a) sample input image; (b) five of the recognized (reprojected) objects along 
with their bounding boxes; (c) a few of the local affine regions; (d) local affine region (patch) 


reprojected into a canonical (square) frame, along with its geometric affine transformations. 


Geometric alignment 


To recognize one or more instances of some known objects, such as those shown in the left 
column of Figure 6.2, the recognition system first extracts a set of interest points in each 
database image and stores the associated descriptors (and original positions) in an indexing 
structure such as a search tree (Section 7.1.3). At recognition time, features are extracted 
from the new image and compared against the stored object features. Whenever a sufficient 
number of matching features (say, three or more) are found for a given object, the system then 
invokes a match verification stage, whose job is to determine whether the spatial arrangement 
of matching features is consistent with those in the database image. 

Because images can be highly cluttered and similar features may belong to several objects, 
the original set of feature matches can have a large number of outliers. For this reason, Lowe 
(2004) suggests using a Hough transform (Section 7.4.2) to accumulate votes for likely geo- 
metric transformations. In his system, he uses an affine transformation between the database 
object and the collection of scene features, which works well for objects that are mostly pla- 
nar, or where at least several corresponding features share a quasi-planar geometry.? 

Another system that uses local affine frames is the one developed by Rothganger, Lazeb- 
nik et al. (2006). In their system, the affine region detector of Mikolajezyk and Schmid 
(2004) is used to rectify local image patches (Figure 6.3d), from which both a SIFT descrip- 
tor and a 10 x 10 UV color histogram are computed and used for matching and recognition. 
Corresponding patches in different views of the same object, along with their local affine 


deformations, are used to compute a 3D affine model for the object using an extension of 


2When a larger number of features is available, a full fundamental matrix can be used (Brown and Lowe 2002; 
Gordon and Lowe 2006). When image stitching is being performed (Brown and Lowe 2007), the motion models 


discussed in Section 8.2.1 can be used instead. 
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the factorization algorithm of Section 11.4.1, which can then be upgraded to a Euclidean re- 
construction (Tomasi and Kanade 1992). At recognition time, local Euclidean neighborhood 
constraints are used to filter potential matches, in a manner analogous to the affine geometric 
constraints used by Lowe (2004) and Obdržálek and Matas (2006). Figure 6.3 shows the 
results of recognizing five objects in a cluttered scene using this approach. 

While feature-based approaches are normally used to detect and localize known objects 
in scenes, it is also possible to get pixel-level segmentations of the scene based on such 
matches. Ferrari, Tuytelaars, and Van Gool (2006b) describe such a system for simultane- 
ously recognizing objects and segmenting scenes, while Kannala, Rahtu et al. (2008) extend 
this approach to non-rigid deformations. Section 6.4 re-visits this topic of joint recognition 
and segmentation in the context of generic class (category) recognition. 

While instance recognition in the early to mid-2000s focused on the problem of locating 
a known 3D object in an image, as shown in Figures 6.2-6.3, attention shifted to the more 
challenging problem of instance retrieval (also known as content-based image retrieval), in 
which the number of images being searched can be very large. Section 7.1.4 reviews such 
techniques, a snapshot of which can be seen in Figure 7.27 and the survey by Zheng, Yang, 
and Tian (2018). This topic is also related to visual similarity search (Section 6.2.3 and 3D 


pose estimation (Section 11.2). 


6.2 Image classification 


While instance recognition techniques are relatively mature and are used in commercial appli- 
cations such as traffic sign recognition (Stallkamp, Schlipsing et al. 2012), generic category 
(class) recognition is still a rapidly evolving research area. Consider for example the set of 
photographs in Figure 6.4a, which shows objects taken from 10 different visual categories. 
(I'll leave it up to you to name each of the categories.) How would you go about writing a 
program to categorize each of these images into the appropriate class, especially if you were 
also given the choice “none of the above”? 

As you can tell from this example, visual category recognition is an extremely challenging 
problem. However, the progress in the field has been quite dramatic, if judged by how much 
better today’s algorithms are compared to those of a decade ago. 

In this section, we review the main classes of algorithms used for whole-image classifi- 
cation. We begin with classic feature-based approaches that rely on handcrafted features and 
their statistics, optionally using machine learning to do the final classification (Figure 5.2b). 
Since such techniques are no longer widely used, we present a fairly terse description of 


the most important techniques. More details can be found in the first edition of this book 
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Figure 6.4 Challenges in image recognition: (a) sample images from the Xerox 10 class 
dataset (Csurka, Dance et al. 2006) O 2007 Springer; (b) axes of difficulty and variation from 
the ImageNet dataset (Russakovsky, Deng et al. 2015) O 2015 Springer. 


(Szeliski 2010, Chapter 14) and in the cited journal papers and surveys. Next, we describe 
modern image classification systems, which are based on the deep neural networks we intro- 
duced in the previous chapter. We then describe visual similarity search, where the task is 
to find visually and semantically similar images, rather than classification into a fixed set of 
categories. Finally, we look at face recognition, since this topic has 1ts own long history and 
set of techniques. 


6.2.1 Feature-based methods 


In this section, we review “classic” feature-based approaches to category recognition (image 
classification). While, historically, part-based representations and recognition algorithms 
(Section 6.2.1) were the preferred approach (Fischler and Elschlager 1973; Felzenszwalb 
and Huttenlocher 2005; Fergus, Perona, and Zisserman 2007), we begin by describing sim- 
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Figure 6.5 Sample images from two widely used image classification datasets: (a) Pascal 
Visual Object Categories (VOC) (Everingham, Eslami et al. 2015) © 2015 Springer; (b) 
ImageNet (Russakovsky, Deng et al. 2015) © 2015 Springer. 
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Figure 6.6 A typical processing pipeline for a bag-of-words category recognition system 
(Csurka, Dance et al. 2006) O 2007 Springer. Features are first extracted at keypoints and 
then quantized to get a distribution (histogram) over the learned visual words (feature clus- 
ter centers). The feature distribution histogram is used to learn a decision surface using a 


classification algorithm, such as a support vector machine. 


pler bag-of-features approaches that represent objects and images as unordered collections 
of feature descriptors. We then review more complex systems constructed with part-based 
models, and then look at how context and scene understanding, as well as machine learning, 
can improve overall recognition results. Additional details on the techniques presented in 
this section can be found in older survey articles, paper collections, and courses (Pinz 2005; 
Ponce, Hebert et al. 2006; Dickinson, Leonardis et al. 2007; Fei-Fei, Fergus, and Torralba 
2009), as well as two review articles on the PASCAL and ImageNet recognition challenges 
(Everingham, Van Gool et al. 2010; Everingham, Eslami et al. 2015; Russakovsky, Deng et 
al. 2015) and the first edition of this book (Szeliski 2010, Chapter 14). 


Bag of words 


One of the simplest algorithms for category recognition is the bag of words (also known as 
bag of features or bag of keypoints) approach (Csurka, Dance et al. 2004; Lazebnik, Schmid, 
and Ponce 2006; Csurka, Dance et al. 2006; Zhang, Marszalek et al. 2007). As shown in 
Figure 6.6, this algorithm simply computes the distribution (histogram) of visual words found 
in the query image and compares this distribution to those found in the training images. We 
will give more details of this approach in Section 7.1.4. The biggest difference from instance 
recognition is the absence of a geometric verification stage (Section 6.1), since individual 
instances of generic visual categories, such as those shown in Figure 6.4a, have relatively 
little spatial coherence to their features (but see the work by Lazebnik, Schmid, and Ponce 
(2006)). 

Csurka, Dance et al. (2004) were the first to use the term bag of keypoints to describe such 
approaches and among the first to demonstrate the utility of frequency-based techniques for 
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category recognition. Their original system used affine covariant regions and SIFT descrip- 
tors, k-means visual vocabulary construction, and both a naive Bayesian classifier and support 
vector machines for classification. (The latter was found to perform better.) Their newer sys- 
tem (Csurka, Dance et al. 2006) uses regular (non-affine) SIFT patches and boosting instead 


of SVMs and incorporates a small amount of geometric consistency information. 


Zhang, Marszalek et al. (2007) perform a more detailed study of such bag of features 
systems. They compare a number of feature detectors (Harris—Laplace (Mikolajczyk and 
Schmid 2004) and Laplacian (Lindeberg 1998b)), descriptors (SIFT, RIFT, and SPIN (Lazeb- 
nik, Schmid, and Ponce 2005)), and SVM kernel functions. 


Instead of quantizing feature vectors to visual words, Grauman and Darrell (2007b) de- 
velop a technique for directly computing an approximate distance between two variably sized 
collections of feature vectors. Their approach is to bin the feature vectors into a multi- 
resolution pyramid defined in feature space and count the number of features that land in 
corresponding bins B;; and B;,. The distance between the two sets of feature vectors (which 
can be thought of as points in a high-dimensional space) is computed using histogram inter- 
section between corresponding bins, while discounting matches already found at finer levels 
and weighting finer matches more heavily. In follow-on work, Grauman and Darrell (2007a) 


show how an explicit construction of the pyramid can be avoided using hashing techniques. 


Inspired by this work, Lazebnik, Schmid, and Ponce (2006) show how a similar idea 
can be employed to augment bags of keypoints with loose notions of 2D spatial location 
analogous to the pooling performed by SIFT (Lowe 2004) and “gist” (Torralba, Murphy et 
al. 2003). In their work, they extract affine region descriptors (Lazebnik, Schmid, and Ponce 
2005) and quantize them into visual words. (Based on previous results by Fei-Fei and Perona 
(2005), the feature descriptors are extracted densely (on a regular grid) over the image, which 
can be helpful in describing textureless regions such as the sky.) They then form a spatial 
pyramid of bins containing word counts (histograms) and use a similar pyramid match kernel 


to combine histogram intersection counts in a hierarchical fashion. 


The debate about whether to use quantized feature descriptors or continuous descriptors 
and also whether to use sparse or dense features went on for many years. Boiman, Shecht- 
man, and Irani (2008) show that if query images are compared to all the features represent- 
ing a given class, rather than just each class image individually, nearest-neighbor matching 
followed by a naive Bayes classifier outperforms quantized visual words. Instead of us- 
ing generic feature detectors and descriptors, some authors have been investigating learning 
class-specific features (Ferencz, Learned-Miller, and Malik 2008), often using randomized 
forests (Philbin, Chum et al. 2007; Moosmann, Nowak, and Jurie 2008; Shotton, Johnson, 


and Cipolla 2008) or combining the feature generation and image classification stages (Yang, 
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Figure 6.7 Using pictorial structures to locate and track a person (Felzenszwalb and Hut- 
tenlocher 2005) O 2005 Springer. The structure consists of articulated rectangular body parts 
(torso, head, and limbs) connected in a tree topology that encodes relative part positions and 
orientations. To fit a pictorial structure model, a binary silhouette image is first computed 
using background subtraction. 


Jin et al. 2008). Others, such as Serre, Wolf, and Poggio (2005) and Mutch and Lowe (2008) 
use hierarchies of dense feature transforms inspired by biological (visual cortical) processing 
combined with SV Ms for final classification. 


Part-based models 


Recognizing an object by finding its constituent parts and measuring their geometric relation- 
ships is one of the oldest approaches to object recognition (Fischler and Elschlager 1973; 
Kanade 1977; Yuille 1991). Part-based approaches were often used for face recognition 
(Moghaddam and Pentland 1997; Heisele, Ho et al. 2003; Heisele, Serre, and Poggio 2007) 
and continue being used for pedestrian detection (Figure 6.24) (Felzenszwalb, McAllester, 
and Ramanan 2008) and pose estimation (Güler, Neverova, and Kokkinos 2018). 

In this overview, we discuss some of the central issues in part-based recognition, namely, 
the representation of geometric relationships, the representation of individual parts, and al- 
gorithms for learning such descriptions and recognizing them at run time. More details on 
part-based models for recognition can be found in the course notes by Fergus (2009). 

The earliest approaches to representing geometric relationships were dubbed pictorial 
structures by Fischler and Elschlager (1973) and consisted of spring-like connections between 
different feature locations (Figure 6.1a). To fit a pictorial structure to an image, an energy 
function of the form 


B=) VL) + >> Vall, l) (6.1) 


ijeE 
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is minimized over all potential part locations or poses {1;} and pairs of parts (i, j) for which 
an edge (geometric relationship) exists in E. Note how this energy is closely related to that 
used with Markov random fields (4.35-4.38), which can be used to embed pictorial struc- 
tures in a probabilistic framework that makes parameter learning easier (Felzenszwalb and 
Huttenlocher 2005). 


Part-based models can have different topologies for the geometric connections between 
the parts (Carneiro and Lowe 2006). For example, Felzenszwalb and Huttenlocher (2005) 
restrict the connections to a tree, which makes learning and inference more tractable. A 
tree topology enables the use of a recursive Viterbi (dynamic programming) algorithm (Pearl 
1988; Bishop 2006), in which leaf nodes are first optimized as a function of their parents, and 
the resulting values are then plugged in and eliminated from the energy function, To further 
increase the efficiency of the inference algorithm, Felzenszwalb and Huttenlocher (2005) 
restrict the pairwise energy functions V;;(l;, lj) to be Mahalanobis distances on functions of 
location variables and then use fast distance transform algorithms to minimize each pairwise 


interaction in time that is closer to linear in N. 


Figure 6.7 shows the results of using their pictorial structures algorithm to fit an articu- 
lated body model to a binary image obtained by background segmentation. In this application 
of pictorial structures, parts are parameterized by the locations, sizes, and orientations of their 
approximating rectangles. Unary matching potentials V;(1;) are determined by counting the 
percentage of foreground and background pixels inside and just outside the tilted rectangle 


representing each part. 


A large number of different graphical models have been proposed for part-based recogni- 
tion. Carneiro and Lowe (2006) discuss a number of these models and propose one of their 
own, which they call a sparse flexible model; it involves ordering the parts and having each 


part’s location depend on at most k of its ancestor locations. 


The simplest models are bags of words, where there are no geometric relationships be- 
tween different parts or features. While such models can be very efficient, they have a very 
limited capacity to express the spatial arrangement of parts. Trees and stars (a special case 
of trees where all leaf nodes are directly connected to a common root) are the most efficient 
in terms of inference and hence also learning (Felzenszwalb and Huttenlocher 2005; Fer- 
gus, Perona, and Zisserman 2005; Felzenszwalb, McAllester, and Ramanan 2008). Directed 
acyclic graphs come next in terms of complexity and can still support efficient inference, 
although at the cost of imposing a causal structure on the part model (Bouchard and Triggs 
2005; Carneiro and Lowe 2006). k-fans, in which a clique of size k forms the root of a star- 
shaped model have inference complexity O(NV**"), although with distance transforms and 


Gaussian priors, this can be lowered to O(N A (Crandall, Felzenszwalb, and Huttenlocher 
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Figure 6.8 The importance of context (images courtesy of Antonio Torralba). Can you 


(a) 


(e) 


name all of the objects in images (a—b), especially those that are circled in (c-d). Look 
carefully at the circled objects. Did you notice that they all have the same shape (after being 
rotated), as shown in column (e)? 


2005; Crandall and Huttenlocher 2006). Finally, fully connected constellation models are 
the most general, but the assignment of features to parts becomes intractable for moderate 
numbers of parts P, since the complexity of such an assignment is O(N”) (Fergus, Perona, 
and Zisserman 2007). 

The original constellation model was developed by Burl, Weber, and Perona (1998) and 
consists of a number of parts whose relative positions are encoded by their mean locations 
and a full covariance matrix, which is used to denote not only positional uncertainty but also 
potential correlations between different parts. Weber, Welling, and Perona (2000) extended 
this technique to a weakly supervised setting, where both the appearance of each part and its 
locations are automatically learned given whole image labels. Fergus, Perona, and Zisserman 
(2007) further extend this approach to simultaneous learning of appearance and shape models 
from scale-invariant keypoint detections. 

The part-based approach to recognition has also been extended to learning new categories 
from small numbers of examples, building on recognition components developed for other 
classes (Fei-Fei, Fergus, and Perona 2006). More complex hierarchical part-based models can 
be developed using the concept of grammars (Bouchard and Triggs 2005; Zhu and Mumford 
2006). A simpler way to use parts is to have keypoints that are recognized as being part of a 
class vote for the estimated part locations (Leibe, Leonardis, and Schiele 2008). Parts can also 
be a useful component of fine-grained category recognition systems, as shown in Figure 6.9. 


Context and scene understanding 


Thus far, we have mostly considered the task of recognizing and localizing objects in isola- 
tion from that of understanding the scene (context) in which the object occur. This is a big 


6.2 Image classification 357 


limitation, as context plays a very important role in human object recognition (Oliva and Tor- 
ralba 2007). Context can greatly improve the performance of object recognition algorithms 
(Divvala, Hoiem et al. 2009), as well as providing useful semantic clues for general scene 
understanding (Torralba 2008). 


Consider the two photographs in Figure 6.8a—b. Can you name all of the objects, espe- 
cially those circled in images (c—d)? Now have a closer look at the circled objects. Do see 
any similarity in their shapes? In fact, if you rotate them by 90°, they are all the same as the 


“blob” shown in Figure 6.8e. So much for our ability to recognize object by their shape! 


Even though we have not addressed context explicitly earlier in this chapter, we have 
already seen several instances of this general idea being used. A simple way to incorporate 
spatial information into a recognition algorithm is to compute feature statistics over different 
regions, as in the spatial pyramid system of Lazebnik, Schmid, and Ponce (2006). Part-based 
models (Figure 6.7) use a kind of local context, where various parts need to be arranged in a 


proper geometric relationship to constitute an object. 


The biggest difference between part-based and context models is that the latter combine 
objects into scenes and the number of constituent objects from each class is not known in 
advance. In fact, it is possible to combine part-based and context models into the same recog- 
nition architecture (Murphy, Torralba, and Freeman 2003; Sudderth, Torralba et al. 2008; 
Crandall and Huttenlocher 2007). 


Consider an image database consisting of street and office scenes. If we have enough 
training images with labeled regions, such as buildings, cars, and roads, or monitors, key- 
boards, and mice, we can develop a geometric model for describing their relative positions. 
Sudderth, Torralba et al. (2008) develop such a model, which can be thought of as a two-level 
constellation model. At the top level, the distributions of objects relative to each other (say, 
buildings with respect to cars) is modeled as a Gaussian. At the bottom level, the distribution 
of parts (affine covariant features) with respect to the object center is modeled using a mix- 
ture of Gaussians. However, since the number of objects in the scene and parts in each object 
are unknown, a latent Dirichlet process (LDP) is used to model object and part creation in 
a generative framework. The distributions for all of the objects and parts are learned from a 
large labeled database and then later used during inference (recognition) to label the elements 
of a scene. 

Another example of context is in simultaneous segmentation and recognition (Section 6.4 
and Figure 6.33), where the arrangements of various objects in a scene are used as part of 
the labeling process. Torralba, Murphy, and Freeman (2004) describe a conditional random 
field where the estimated locations of building and roads influence the detection of cars, and 
where boosting is used to learn the structure of the CRF. Rabinovich, Vedaldi et al. (2007) 
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use context to improve the results of CRF segmentation by noting that certain adjacencies 
(relationships) are more likely than others, e.g., a person is more likely to be on a horse 
than on a dog. Galleguillos and Belongie (2010) review various approaches proposed for 
adding context to object categorization, while Yao and Fei-Fei (2012) study human-object 
interactions. (For a more recent take on this problem, see Gkioxari, Girshick et al. (2018).) 
Context also plays an important role in 3D inference from single images (Figure 6.41), 
using computer vision techniques for labeling pixels as belonging to the ground, vertical 
surfaces, or sky (Hoiem, Efros, and Hebert 2005a). This line of work has been extended to 
a more holistic approach that simultaneously reasons about object identity, location, surface 
orientations, occlusions, and camera viewing parameters (Hoiem, Efros, and Hebert 2008). 
A number of approaches use the gist of a scene (Torralba 2003; Torralba, Murphy et al. 
2003) to determine where instances of particular objects are likely to occur. For example, 
Murphy, Torralba, and Freeman (2003) train a regressor to predict the vertical locations of 
objects such as pedestrians, cars, and buildings (or screens and keyboards for indoor office 
scenes) based on the gist of an image. These location distributions are then used with classic 
object detectors to improve the performance of the detectors. Gists can also be used to directly 
match complete images, as we saw in the scene completion work of Hays and Efros (2007). 
Finally, some of the work in scene understanding exploits the existence of large numbers 
of labeled (or even unlabeled) images to perform matching directly against whole images, 
where the images themselves implicitly encode the expected relationships between objects 
(Russell, Torralba et al. 2007; Malisiewicz and Efros 2008; Galleguillos and Belongie 2010). 
This, of course, is one of the central benefits of using deep neural networks, which we discuss 


in the next section. 


6.2.2 Deep networks 


As we saw in Section 5.4.3, deep networks started outperforming “shallow” learning-based 
approaches on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with the 
introduction of the “AlexNet” SuperVision system of Krizhevsky, Sutskever, and Hinton 
(2012). Since that time, recognition accuracy has continued to improve dramatically (Fig- 
ure 5.40) driven to a large degree by deeper networks and better training algorithms. More 
recently, more efficient networks have become the focus of research (Figure 5.45) as well as 
larger (unlabeled) training datasets (Section 5.4.7). There are now open-source frameworks 
such as Classy Vision? for training and fine tuning your own image and video classification 
models. Users can also upload custom images on the web to the Computer Vision Explorer* 


3https://classyvision.ai 
4https://vision-explorer.allenai.org 
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Figure 6.9  Fine-grained category recognition using parts (Zhang, Donahue et al. 2014) © 
2014 Springer. Deep neural network object and part detectors are trained and their outputs 
are combined using geometric constraints. A classifier trained on features from the extracted 


parts is used for the final categorization. 


to see how well many popular computer vision models perform on their own images. 

In addition to recognizing commonly occurring categories such as those found in the Im- 
ageNet and COCO datasets, researchers have studied the problem of fine-grained category 
recognition (Duan, Parikh et al. 2012; Zhang, Donahue et al. 2014; Krause, Jin et al. 2015), 
where the differences between sub-categories can be subtle and the number of exemplars is 
quite low (Figure 6.9). Examples of categories with fine-grained sub-classes include flowers 
(Nilsback and Zisserman 2006), cats and dogs (Parkhi, Vedaldi et al. 2012), birds (Wah, Bran- 
son et al. 2011; Van Horn, Branson et al. 2015), and cars (Yang, Luo et al. 2015). A recent 
example of fine-grained categorization is the iNaturalist system (Van Horn, Mac Aodha et al. 
2018),? which allows both specialists and citizen scientists to photograph and label biological 
species, using a fine-grained category recognition system to label new images (Figure 6.10a). 

Fine-grained categorization is often attacked using attributes of images and classes (Lam- 
pert, Nickisch, and Harmeling 2009; Parikh and Grauman 2011; Lampert, Nickisch, and 
Harmeling 2014), as shown in Figure 6.10b. Extracting attributes can enable zero-shot learn- 
ing (Xian, Lampert et al. 2019), where previously unseen categories can be described us- 


ing combinations of such attributes. However, some caution must be used in order not to 


Shttps://www.inaturalist.org 


360 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


otter 
black: 
white: 
brown: 
stripes: 
water: 
eats fish: yes hp 


polar bear 


Two-spotted ladybug Seven-spotted ladybug eats fish: r 
Adalia bipunctata Coccinella septempunctata 


(a) 


Figure 6.10  Fine-grained category recognition. (a) The iNaturalist website and app allows 
citizen scientists to collect and classify images on their phones (Van Horn, Mac Aodha et al. 
2018) © 2018 IEEE. (b) Attributes can be used for fine-grained categorization and zero-shot 
learning (Lampert, Nickisch, and Harmeling 2014) © 2014 Springer. These images are part 
of the Animals with Attributes dataset. 


learn spurious correlations between different attributes (Jayaraman, Sha, and Grauman 2014) 
or between objects and their common contexts (Singh, Mahajan et al. 2020). Fine-grained 
recognition can also be tackled using metric learning (Wu, Manmatha et al. 2017) or nearest- 
neighbor visual similarity search (Touvron, Sablayrolles et al. 2020), which we discuss next. 


6.2.3 Application: Visual similarity search 


Automatically classifying images into categories and tagging them with attributes using com- 
puter vision algorithms makes it easier to find them in catalogues and on the web. This is 
commonly used in image search or image retrieval engines, which find likely images based 
on keywords, just as regular web search engines find relevant documents and pages. 

Sometimes, however, it’s easier to find the information you need from an image, i.e., 
using visual search. Examples of this include fine-grained categorization, which we have 
just seen, as well as instance retrieval, i.e., finding the exact same object (Section 6.1) or 
location (Section 11.2.3). Another variant is finding visually similar images (often called 
visual similarity search or reverse image search), which is useful when the search intent 
cannot be succinctly captured in words.° 


6Some authors use the term image retrieval to denote visual similarity search, (e.g., Jégou, Perronnin et al. 2012; 


6.2 Image classification 361 


tT Extract Find 
a 
7” Embedding ~~ Match 


Figure 6.11 The GrokNet product recognition service is used for product tagging, visual 
search, and recommendations © Bell, Liu et al. (2020): (a) recognizing all the products in 
a photo; (b) automatically sourcing data for metric learning using weakly supervised data 


augmentation. 


The topic of searching by visual similarity has a long history and goes by a variety 
of names, including query by image content (QBIC) (Flickner, Sawhney et al. 1995) and 
content-based image retrieval (CBIR) (Smeulders, Worring et al. 2000; Lew, Sebe et al. 
2006; Vasconcelos 2007; Datta, Joshi et al. 2008). Early publications in these fields were 
based primarily on simple whole-image similarity metrics, such as color and texture (Swain 
and Ballard 1991; Jacobs, Finkelstein, and Salesin 1995; Manjunathi and Ma 1996). 

Later architectures, such as that by Fergus, Perona, and Zisserman (2004), use a feature- 
based learning and recognition algorithm to re-rank the outputs from a traditional keyword- 
based image search engine. In follow-on work, Fergus, Fei-Fei et al. (2005) cluster the results 
returned by image search using an extension of probabilistic latest semantic analysis (PLSA) 
(Hofmann 1999) and then select the clusters associated with the highest ranked results as the 
representative images for that category. Other approaches rely on carefully annotated image 
databases such as LabelMe (Russell, Torralba et al. 2008). For example, Malisiewicz and 
Efros (2008) describe a system that, given a query image, can find similar LabelMe images, 
whereas Liu, Yuen, and Torralba (2009) combine feature-based correspondence algorithms 
with the labeled database to perform simultaneous recognition and segmentation. 

Newer approaches to visual similarity search use whole-image descriptors such as Fisher 
kernels and the Vector of Locally Aggregated Descriptors (VLAD) (Jégou, Perronnin et al. 
2012) or pooled CNN activations (Babenko and Lempitsky 2015a; Tolias, Sicre, and Jégou 
2016; Cao, Araujo, and Sim 2020; Ng, Balntas et al. 2020; Tolias, Jenicek, and Chum 2020) 
combined with metric learning (Bell and Bala 2015; Song, Xiang et al. 2016; Gordo, Al- 
mazán et al. 2017; Wu, Manmatha et al. 2017; Berman, Jégou et al. 2019) to represent each 
image with a compact descriptor that can be used to measure similarity in large databases 
(Johnson, Douze, and Jégou 2021). It is also possible to combine several techniques, such 
as deep networks with VLAD (Arandjelovic, Gronat et al. 2016), generalized mean (GeM) 


Radenović, Tolias, and Chum 2019). 
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Figure 6.12 The GrokNet training architecture uses seven datasets, a common DNN trunk, 
two branches, and 83 loss functions (80 categorical losses + 3 embedding losses) © Bell, Liu 
et al. (2020). 


pooling (Radenovié, Tolias, and Chum 2019), or dynamic mean (DAME) pooling (Yang, 
Kien Nguyen et al. 2019) into complete systems that are end-to-end tunable. Gordo, Al- 
mazan et al. (2017) provide a comprehensive review and experimental comparison of many 
of these techniques, which we also discuss in Section 7.1.4 on large-scale matching and re- 
trieval. Some of the latest techniques for image retrieval use combinations of local and global 
descriptors to obtain state-of-the art performance on the landmark recognition tasks (Cao, 
Araujo, and Sim 2020; Ng, Balntas et al. 2020; Tolias, Jenicek, and Chum 2020). The ECCV 
2020 Workshop on Instance-Level Recognition’ has pointers to some of the latest work in 
this area, while the upcoming NeurIPS’21 Image Similarity Challenge? has new datasets for 
detecting content manipulation. 

A recent example of a commercial system that uses visual similarity search, in addition to 
category recognition, is the GrokNet product recognition service described by Bell, Liu et al. 
(2020). GrokNet takes as input user images and shopping queries and returns indexed items 
similar to the ones in the query image (Figure 6.11a). The reason for needing a similarity 
search component is that the world contains too many “long-tail” items such as “a fur sink, an 
electric dog polisher, or a gasoline powered turtleneck sweater”,’ to make full categorization 
practical. 

At training time, GrokNet takes both weakly labeled images, with category and/or at- 
tribute labels, and unlabeled images, where features in objects are detected and then used for 


metric learning, using a modification of ArcFace loss (Deng, Guo ef al. 2019) and a novel 


Thttps://ilr- workshop. github.io/ECCV W2020 
Shttps://www.drivendata.org/competitions/79/ 
*https://www.google.com/search?q=gasoline+powered+turtleneck+sweater 
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Figure 6.13 Humans can recognize low-resolution faces of familiar people (Sinha, Balas 
et al. 2006) © 2006 IEEE. 


pairwise margin loss (Figure 6.11b). The overall system takes in large collections of un- 
labeled and weakly labeled images and trains a ResNeXt101 trunk using a combination of 
category and attribute softmax losses and three different metric losses on the embeddings 
(Figure 6.12). GrokNet is just one example of a large number of commercial visual prod- 
uct search systems that have recently been developed. Others include systems from Amazon 
(Wu, Manmatha et al. 2017), Pinterest (Zhai, Wu et al. 2019), and Facebook (Tang, Borisyuk 
et al. 2019). In addition to helping people find items they may with to purchase, large-scale 
similarity search can also speed the search for harmful content on the web, as exemplified in 
Facebook’s SimSearchNet.!° 


6.2.4 Face recognition 


Among the various recognition tasks that computers are asked to perform, face recognition 
is the one where they have arguably had the most success.!! While even people cannot read- 
ily distinguish between similar people with whom they are not familiar (O’ Toole, Jiang et 
al. 2006; O’Toole, Phillips et al. 2009), computers’ ability to distinguish among a small 
number of family members and friends has found its way into consumer-level photo applica- 
tions. Face recognition can be used in a variety of additional applications, including human— 
computer interaction (HCI), identity verification (Kirovski, Jojic, and Jancke 2004), desktop 
login, parental controls, and patient monitoring (Zhao, Chellappa et al. 2003), but it also has 
the potential for misuse (Chokshi 2019; Ovide 2020). 


Face recognizers work best when they are given images of faces under a wide variety of 


lOhttps://ai.facebook.com/blog/using-ai-to-detect-covid- 19-misinformation-and-exploitative-content 

lnstance recognition, i.e., the re-recognition of known objects such as locations or planar objects, is the other 
most successful application of general image recognition. In the general domain of biometrics, i.e., identity recogni- 
tion, specialized images such as irises and fingerprints perform even better (Jain, Bolle, and Pankanti 1999; Daugman 
2004). 
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Name/URL Contents/Reference 

CMU Multi-PIE database 337 people’s faces in various poses 
http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie Gross, Matthews et al. (2010) 

Faces in the Wild 5,749 internet celebrities 
http://vis- www.cs.umass.edu/lfw Huang, Ramesh et al. (2007) 

YouTube Faces (YTF) 1,595 people in 3,425 YouTube videos 
https://www.cs.tau.ac.il/~wolf/ytfaces Wolf, Hassner, and Maoz (2011) 

MegaFace 1M internet faces 
https://megaface.cs.washington.edu Nech and Kemelmacher-Shlizerman (2017) 

IARPA Janus Benchmark (IJB) 31,334 faces of 3,531 people in videos 
https://www.nist.gov/programs- projects/face- challenges Maze, Adams et al. (2018) 

WIDER FACE 32,203 images for face detection 
http://shuoyang1213.me/WIDERFACE Yang, Luo et al. (2016) 


Table 6.1 Face recognition and detection datasets, adapted from Maze, Adams et al. 
(2018). 


pose, illumination, and expression (PIE) conditions (Phillips, Moon et al. 2000; Sim, Baker, 
and Bsat 2003; Gross, Shi, and Cohn 2005; Huang, Ramesh et al. 2007; Phillips, Scruggs 
et al. 2010). More recent widely used datasets include labeled Faces in the Wild (LFW) 
(Huang, Ramesh et al. 2007; Learned-Miller, Huang et al. 2016), YouTube Faces (YTF) 
(Wolf, Hassner, and Maoz 2011), MegaFace (Kemelmacher-Shlizerman, Seitz et al. 2016; 
Nech and Kemelmacher-Shlizerman 2017), and the [ARPA Janus Benchmark (IJB) (Klare, 
Klein et al. 2015; Maze, Adams et al. 2018), as tabulated in Table 6.1. (See Masi, Wu et al. 
(2018) for additional datasets used for training.) 

Some of the earliest approaches to face recognition involved finding the locations of 
distinctive image features, such as the eyes, nose, and mouth, and measuring the distances 
between these feature locations (Fischler and Elschlager 1973; Kanade 1977; Yuille 1991). 
Other approaches relied on comparing gray-level images projected onto lower dimensional 
subspaces called eigenfaces (Section 5.2.3) and jointly modeling shape and appearance vari- 
ations (while discounting pose variations) using active appearance models (Section 6.2.4). 
Descriptions of “classic” (pre-DNN) face recognition systems can be found in a number of 
surveys and books on this topic (Chellappa, Wilson, and Sirohey 1995; Zhao, Chellappa et al. 
2003; Li and Jain 2005) as well as the Face Recognition website.!? The survey on face recog- 
nition by humans by Sinha, Balas et al. (2006) is also well worth reading; it includes a number 


of surprising results, such as humans’ ability to recognize low-resolution images of familiar 


Bhttps://www.face-rec.org 


6.2 Image classification 365 


(a) (b) (c) (d) (e) 


Figure 6.14 Manipulating facial appearance through shape and color (Rowland and Per- 
rett 1995) © 1995 IEEE. By adding or subtracting gender-specific shape and color charac- 
teristics to an input image (b), different amounts of gender variation can be induced. The 
amounts added (from the mean) are: (a) +50% (gender enhancement), (b) 0% (original im- 
age), (c) -50% (near “androgyny”), (d) -100% (gender switched), and (e) -150% (opposite 


gender attributes enhanced). 


faces (Figure 6.13) and the importance of eyebrows in recognition. Researchers have also 
studied the automatic recognition of facial expressions. See Chang, Hu ef al. (2006), Shan, 
Gong, and McOwan (2009), and Li and Deng (2020) for some representative papers. 


Active appearance and 3D shape models 


The need to use modular or view-based eigenspaces for face recognition, which we discussed 
in Section 5.2.3, is symptomatic of a more general observation, i.e., that facial appearance 
and identifiability depend as much on shape as they do on color or texture (which is what 
eigenfaces capture). Furthermore, when dealing with 3D head rotations, the pose of a person’s 
head should be discounted when performing recognition. 

In fact, the earliest face recognition systems, such as those by Fischler and Elschlager 
(1973), Kanade (1977), and Yuille (1991), found distinctive feature points on facial images 
and performed recognition on the basis of their relative positions or distances. Later tech- 
niques such as local feature analysis (Penev and Atick 1996) and elastic bunch graph match- 
ing (Wiskott, Fellous et al. 1997) combined local filter responses (jets) at distinctive feature 
locations together with shape models to perform recognition. 

A visually compelling example of why both shape and texture are important is the work 


366 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


of Rowland and Perrett (1995), who manually traced the contours of facial features and then 
used these contours to normalize (warp) each image to a canonical shape. After analyzing 
both the shape and color images for deviations from the mean, they were able to associate 
certain shape and color deformations with personal characteristics such as age and gender 
(Figure 6.14). Their work demonstrates that both shape and color have an important influence 
on the perception of such characteristics. 

Around the same time, researchers in computer vision were beginning to use simultane- 
ous shape deformations and texture interpolation to model the variability in facial appearance 
caused by identity or expression (Beymer 1996; Vetter and Poggio 1997), developing tech- 
niques such as Active Shape Models (Lanitis, Taylor, and Cootes 1997), 3D Morphable Mod- 
els (Blanz and Vetter 1999; Egger, Smith et al. 2020), and Elastic Bunch Graph Matching 
(Wiskott, Fellous et al. 1997).° 

The active appearance models (AAMs) of Cootes, Edwards, and Taylor (2001) model 
both the variation in the shape of an image s, which is normally encoded by the location of 
key feature points on the image, as well as the variation in texture t, which is normalized to a 
canonical shape before being analyzed. Both shape and texture are represented as deviations 


from a mean shape 5 and texture t, 
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where the eigenvectors in U, and U; have been pre-scaled (whitened) so that unit vectors in 
a represent one standard deviation of variation observed in the training data. In addition to 
these principal deformations, the shape parameters are transformed by a global similarity to 
match the location, size, and orientation of a given face. Similarly, the texture image contains 
a scale and offset to best match novel illumination conditions. 

As you can see, the same appearance parameters a in (6.2-6.3) simultaneously control 
both the shape and texture deformations from the mean, which makes sense if we believe 
them to be correlated. Figure 6.15 shows how moving three standard deviations along each 
of the first four principal directions ends up changing several correlated factors in a person’s 
appearance, including expression, gender, age, and identity. 

Although active appearance models are primarily designed to accurately capture the vari- 
ability in appearance and deformation that are characteristic of faces, they can be adapted 
to face recognition by computing an identity subspace that separates variation in identity 


from other sources of variability such as lighting, pose, and expression (Costen, Cootes et al. 


!3We will look at the application of PCA to 3D head and face modeling and animation in Section 13.6.3. 
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Figure 6.15 Principal modes of variation in active appearance models (Cootes, Edwards, 
and Taylor 2001) © 2001 IEEE. The four images show the effects of simultaneously changing 


the first four modes of variation in both shape and texture by to from the mean. You can 


clearly see how the shape of the face and the shading are simultaneously affected. 


1999). The basic idea, which is modeled after similar work in eigenfaces (Belhumeur, Hes- 
panha, and Kriegman 1997; Moghaddam, Jebara, and Pentland 2000), is to compute separate 
statistics for intrapersonal and extrapersonal variation and then find discriminating directions 
in these subspaces. While AAMs have sometimes been used directly for recognition (Blanz 
and Vetter 2003), their main use in the context of recognition is to align faces into a canoni- 
cal pose (Liang, Xiao et al. 2008; Ren, Cao et al. 2014) so that more traditional methods of 
face recognition (Penev and Atick 1996; Wiskott, Fellous et al. 1997; Ahonen, Hadid, and 
Pietikäinen 2006; Zhao and Pietikäinen 2007; Cao, Yin et al. 2010) can be used. 


Active appearance models have been extended to deal with illumination and viewpoint 
variation (Gross, Baker et al. 2005) as well as occlusions (Gross, Matthews, and Baker 2006). 
One of the most significant extensions is to construct 3D models of shape (Matthews, Xiao, 
and Baker 2007), which are much better at capturing and explaining the full variability of 
facial appearance across wide changes in pose. Such models can be constructed either from 
monocular video sequences (Matthews, Xiao, and Baker 2007), as shown in Figure 6. 16a, 
or from multi-view video sequences (Ramnath, Koterba et al. 2008), which provide even 
greater reliability and accuracy in reconstruction and tracking (Murphy-Chutorian and Trivedi 
2009). 
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Figure 6.16 Head tracking and frontalization: (a) using 3D active appearance models 
(AAMs) (Matthews, Xiao, and Baker 2007) O 2007 Springer, showing video frames along 
with the estimated yaw, pitch, and roll parameters and the fitted 3D deformable mesh; (b) 
using six and then 67 fiducial points in the DeepFace system (Taigman, Yang et al. 2014) O 
2014 IEEE, used to frontalize the face image (bottom row). 
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Figure 6.17 The DeepFace architecture (Taigman, Yang et al. 2014) © 2014 IEEE, starts 
with a frontalization stage, followed by several locally connected (non-convolutional) layers, 
and then two fully connected layers with a K-class softmax. 
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Facial recognition using deep learning 


Prompted by the dramatic success of deep networks in whole-image categorization, face 
recognition researchers started using deep neural network backbones as part of their sys- 
tems. Figures 6.16b-6.17 shows two stages in the DeepFace system of Taigman, Yang et al. 
(2014), which was one of the first systems to realize large gains using deep networks. In their 
system, a landmark-based pre-processing frontalization step is used to convert the original 
color image into a well-cropped front-looking face. Then, a deep locally connected network 
(where the convolution kernels can vary spatially) is fed into two final fully connected layers 
before classification. 

Some of the more recent deep face recognizers omit the frontalization stage and instead 
use data augmentation (Section 5.3.3) to create synthetic inputs with a larger variety of poses 
(Schroff, Kalenichenko, and Philbin 2015; Parkhi, Vedaldi, and Zisserman 2015). Masi, Wu 
et al. (2018) provide an excellent tutorial and survey on deep face recognition, including a list 
of widely used training and testing datasets, a discussion of frontalization and dataset aug- 
mentation, and a section on training losses (Figure 6.18). This last topic is central to the ability 
to scale to larger and larger numbers of people. Schroff, Kalenichenko, and Philbin (2015) 
and Parkhi, Vedaldi, and Zisserman (2015) use triplet losses to construct a low-dimensional 
embedding space that is independent of the number of subjects. More recent systems use 
contrastive losses inspired by the softmax function, which we discussed in Section 5.3.4. For 
example, the ArcFace paper by Deng, Guo et al. (2019) measures angular distances on the 
unit hypersphere in the embedding space and adds an extra margin to get identities to clump 
together. This idea has been further extended for visual similarity search (Bell, Liu et al. 
2020) and face recognition (Huang, Shen et al. 2020; Deng, Guo et al. 2020a). 


Personal photo collections 


In addition to digital cameras automatically finding faces to aid in auto-focusing and video 
cameras finding faces in video conferencing to center on the speaker (either mechanically 
or digitally), face detection has found its way into most consumer-level photo organization 
packages and photo sharing sites. Finding faces and allowing users to tag them makes it easier 
to find photos of selected people at a later date or to automatically share them with friends. 
In fact, the ability to tag friends in photos is one of the more popular features on Facebook. 
Sometimes, however, faces can be hard to find and recognize, especially if they are small, 
turned away from the camera, or otherwise occluded. In such cases, combining face recog- 
nition with person detection and clothes recognition can be very effective, as illustrated in 


Figure 6.19 (Sivic, Zitnick, and Szeliski 2006). Combining person recognition with other 
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Figure 6.18 A typical modern deep face recognition architecture, from the survey by Masi, 
Wu et al. (2018) O 2018 IEEE. At training time, a huge labeled face set (a) is used to constrain 
the weights ofa DCNN (b), optimizing a loss function (c) for a classification task. At test time, 
the classification layer is often discarded, and the DCNN is used as a feature extractor for 


comparing face descriptors. 


kinds of context, such as location recognition (Section 11.2.3) or activity or event recogni- 
tion, can also help boost performance (Lin, Kapoor et al. 2010). 


6.3 Object detection 


If we are given an image to analyze, such as the group portrait in Figure 6.20, we could try to 
apply a recognition algorithm to every possible sub-window in this image. Such algorithms 
are likely to be both slow and error-prone. Instead, it is more effective to construct special- 
purpose detectors, whose job it is to rapidly find likely regions where particular objects might 
occur. 

We begin this section with face detectors, which were some of the earliest successful 
examples of recognition. Such algorithms are built into most of today’s digital cameras to 
enhance auto-focus and into video conferencing systems to control panning and zooming. We 
then look at pedestrian detectors, as an example of more general methods for object detection. 
Finally, we turn to the problem of multi-class object detection, which today is solved using 


deep neural networks. 
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(b) 


Figure 6.19 Person detection and re-recognition using a combined face, hair, and torso 
model (Sivic, Zitnick, and Szeliski 2006) O 2006 Springer. (a) Using face detection alone, 
several of the heads are missed. (b) The combined face and clothing model successfully 


re-finds all the people. 


6.3.1 Face detection 


Before face recognition can be applied to a general image, the locations and sizes of any faces 
must first be found (Figures 6.1c and 6.20). In principle, we could apply a face recognition 
algorithm at every pixel and scale (Moghaddam and Pentland 1997) but such a process would 
be too slow in practice. 


Over the last four decades, a wide variety of fast face detection algorithms have been 
developed. Yang, Kriegman, and Ahuja (2002) and Zhao, Chellappa et al. (2003) provide 
comprehensive surveys of earlier work in this field. According to their taxonomy, face de- 
tection techniques can be classified as feature-based, template-based, or appearance-based. 
Feature-based techniques attempt to find the locations of distinctive image features such as 
the eyes, nose, and mouth, and then verify whether these features are in a plausible geometri- 
cal arrangement. These techniques include some of the early approaches to face recognition 
(Fischler and Elschlager 1973; Kanade 1977; Yuille 1991), as well as later approaches based 
on modular eigenspaces (Moghaddam and Pentland 1997), local filter jets (Leung, Burl, and 
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Figure 6.20 Face detection results produced by Rowley, Baluja, and Kanade (1998) © 
1998 IEEE. Can you find the one false positive (a box around a non-face) among the 57 true 


positive results? 


Perona 1995; Penev and Atick 1996; Wiskott, Fellous ef al. 1997), support vector machines 
(Heisele, Ho et al. 2003; Heisele, Serre, and Poggio 2007), and boosting (Schneiderman and 
Kanade 2004). 

Template-based approaches, such as active appearance models (AAMs) (Section 6.2.4), 
can deal with a wide range of pose and expression variability. Typically, they require good 
initialization near a real face and are therefore not suitable as fast face detectors. 

Appearance-based approaches scan over small overlapping rectangular patches of the im- 
age searching for likely face candidates, which can then be refined using a cascade of more 
expensive but selective detection algorithms (Sung and Poggio 1998; Rowley, Baluja, and 
Kanade 1998; Romdhani, Torr et al. 2001; Fleuret and Geman 2001; Viola and Jones 2004). 
To deal with scale variation, the image is usually converted into a sub-octave pyramid and a 
separate scan is performed on each level. Most appearance-based approaches rely heavily on 
training classifiers using sets of labeled face and non-face patches. 

Sung and Poggio (1998) and Rowley, Baluja, and Kanade (1998) present two of the ear- 
liest appearance-based face detectors and introduce a number of innovations that are widely 
used in later work by others. To start with, both systems collect a set of labeled face patches 
(Figure 6.20) as well as a set of patches taken from images that are known not to contain 
faces, such as aerial images or vegetation. The collected face images are augmented by arti- 
ficially mirroring, rotating, scaling, and translating the images by small amounts to make the 
face detectors less sensitive to such effects. 


The next few paragraphs provide quick reviews of a number of early appearance-based 
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Figure 6.21 A neural network for face detection (Rowley, Baluja, and Kanade 1998) © 
1998 IEEE. Overlapping patches are extracted from different levels of a pyramid and then 


pre-processed. A three-layer neural network is then used to detect likely face locations. 


face detectors, keyed by the machine algorithms they are based on. These systems provide an 
interesting glimpse into the gradual adoption and evolution of machine learning in computer 
vision. More detailed descriptions can be found in the original papers, as well as the first 
edition of this book (Szeliski 2010). 


Clustering and PCA. Once the face and non-face patterns have been pre-processed, Sung 
and Poggio (1998) cluster each of these datasets into six separate clusters using k-means and 
then fit PCA subspaces to each of the resulting 12 clusters. At detection time, the DIFS and 
DFFS metrics first developed by Moghaddam and Pentland (1997) are used to produce 24 
Mahalanobis distance measurements (two per cluster). The resulting 24 measurements are 


input to a multi-layer perceptron (MLP), i.e., a fully connected neural network. 


Neural networks. Instead of first clustering the data and computing Mahalanobis distances 
to the cluster centers, Rowley, Baluja, and Kanade (1998) apply a neural network (MLP) 
directly to the 20 x 20 pixel patches of gray-level intensities, using a variety of differently 
sized hand-crafted “receptive fields” to capture both large-scale and smaller scale structure 
(Figure 6.21). The resulting neural network directly outputs the likelihood of a face at the 
center of every overlapping patch in a multi-resolution pyramid. Since several overlapping 
patches (in both space and resolution) may fire near a face, an additional merging network is 
used to merge overlapping detections. The authors also experiment with training several net- 


works and merging their outputs. Figure 6.20 shows a sample result from their face detector. 
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Support vector machines. Instead of using a neural network to classify patches, Osuna, 
Freund, and Girosi (1997) use support vector machines (SVMs), which we discussed in Sec- 
tion 5.1.4, to classify the same preprocessed patches as Sung and Poggio (1998). An SVM 
searches for a series of maximum margin separating planes in feature space between different 
classes (in this case, face and non-face patches). In those cases where linear classification 
boundaries are insufficient, the feature space can be lifted into higher-dimensional features 
using kernels (5.29). SVMs have been used by other researchers for both face detection and 
face recognition (Heisele, Ho et al. 2003; Heisele, Serre, and Poggio 2007) as well as general 
object recognition (Lampert 2008). 


Boosting. Of all the face detectors developed in the 2000s, the one introduced by Viola 
and Jones (2004) is probably the best known. Their technique was the first to introduce the 
concept of boosting to the computer vision community, which involves training a series of 
increasingly discriminating simple classifiers and then blending their outputs (Bishop 2006, 
Section 14.3; Hastie, Tibshirani, and Friedman 2009, Chapter 10; Murphy 2012, Section 16.4; 
Glassner 2018, Section 14.7). 

In more detail, boosting involves constructing a classifier h(x) as a sum of simple weak 


learners, 
h(x) = sign 5 Oy A) |, (6.4) 


where each of the weak learners h; (x) is an extremely simple function of the input, and hence 
is not expected to contribute much (in isolation) to the classification performance. 


In most variants of boosting, the weak learners are threshold functions, 


a; if fi <0; 


(6.5) 
b; otherwise, 


hj(x) = aj[fj < 07] + bj[f; > 05] = 


which are also known as decision stumps (basically, the simplest possible version of decision 


trees). In most cases, it is also traditional (and simpler) to set a; and b; to +1, i.e., aj = — Sj, 
bj = +8,, so that only the feature f;, the threshold value 0;, and the polarity of the threshold 


sj € El need to be selected.!* 


In many applications of boosting, the features are simply coordinate axes zz, i.e., the 
boosting algorithm selects one of the input vector components as the best one to threshold. In 
Viola and Jones” face detector, the features are differences of rectangular regions in the input 


patch, as shown in Figure 6.22. The advantage of using these features is that, while they are 


'4Some variants, such as that of Viola and Jones (2004), use (aj, bj) € [0, 1] and adjust the learning algorithm 


accordingly. 
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(a) 


Figure 6.22 Simple features used in boosting-based face detector (Viola and Jones 2004) 
O 2004 Springer: (a) difference of rectangle feature composed of 24 different rectangles 
(pixels inside the white rectangles are subtracted from the gray ones); (b) the first and second 
features selected by AdaBoost. The first feature measures the differences in intensity between 


the eyes and the cheeks, the second one between the eyes and the bridge of the nose. 


more discriminating than single pixels, they are extremely fast to compute once a summed 
area table has been precomputed, as described in Section 3.2.3 (3.31-3.32). Essentially, for 
the cost of an O(N) precomputation phase (where NV is the number of pixels in the image), 
subsequent differences of rectangles can be computed in 4r additions or subtractions, where 
r € (2,3, 4} is the number of rectangles in the feature. 

The key to the success of boosting is the method for incrementally selecting the weak 
learners and for re-weighting the training examples after each stage. The AdaBoost (Adaptive 
Boosting) algorithm (Bishop 2006; Hastie, Tibshirani, and Friedman 2009; Murphy 2012) 
does this by re-weighting each sample as a function of whether it is correctly classified at each 
stage, and using the stage-wise average classification error to determine the final weightings 
a; among the weak classifiers. 

To further increase the speed of the detector, it is possible to create a cascade of classifiers, 
where each classifier uses a small number of tests (say, a two-term AdaBoost classifier) to 
reject a large fraction of non-faces while trying to pass through all potential face candidates 
(Fleuret and Geman 2001; Viola and Jones 2004; Brubaker, Wu et al. 2008). 


Deep networks. Since the initial burst of face detection research in the early 2000s, face de- 
tection algorithms have continued to evolve and improve (Zafeiriou, Zhang, and Zhang 2015). 
Researchers have proposed using cascades of features (Li and Zhang 2013), deformable parts 
models (Mathias, Benenson ef al. 2014), aggregated channel features (Yang, Yan et al. 2014), 
and neural networks (Li, Lin et al. 2015; Yang, Luo et al. 2015). The WIDER FACE bench- 
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Figure 6.23 Pedestrian detection using histograms of oriented gradients (Dalal and Triggs 


(c) (2) 


2005) O 2005 IEEE: (a) the average gradient image over the training examples; (b) each 
“pixel” shows the maximum positive SVM weight in the block centered on the pixel; (c) like- 
wise, for the negative SVM weights; (d) a test image; (e) the computed R-HOG (rectangular 
histogram of gradients) descriptor; (f) the R-HOG descriptor weighted by the positive SVM 
weights; (2) the R-HOG descriptor weighted by the negative SVM weights. 


mark!*-1% (Yang, Luo et al. 2016) contains results from, and pointers to, more recent papers, 
including RetinaFace (Deng, Guo et al. 2020b), which combines ideas from other recent 
neural networks and object detectors such as Feature Pyramid Networks (Lin, Dollár et al. 
2017) and RetinaNet (Lin, Goyal et al. 2017), and also has a nice review of other recent face 


detectors. 


6.3.2 Pedestrian detection 


While a lot of the early research on object detection focused on faces, the detection of other 
objects, such as pedestrians and cars, has also received widespread attention (Gavrila and 
Philomin 1999; Gavrila 1999; Papageorgiou and Poggio 2000; Mohan, Papageorgiou, and 
Poggio 2001; Schneiderman and Kanade 2004). Some of these techniques maintained the 
same focus as face detection on speed and efficiency. Others, however, focused on accuracy, 
viewing detection as a more challenging variant of generic class recognition (Section 6.3.3) 
in which the locations and extents of objects are to be determined as accurately as possible 
(Everingham, Van Gool et al. 2010; Everingham, Eslami et al. 2015; Lin, Maire et al. 2014). 

An example of a well-known pedestrian detector is the algorithm developed by Dalal 


and Triggs (2005), who use a set of overlapping histogram of oriented gradients (HOG) 


'Shttp://shuoyang1213.me/WIDERFACE 
lóThe WIDER FACE benchmark has expanded to a larger set of detection challenges and workshops: https: 
//wider-challenge.org/2019.html. 
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descriptors fed into a support vector machine (Figure 6.23). Each HOG has cells to accu- 
mulate magnitude-weighted votes for gradients at particular orientations, just as in the scale 
invariant feature transform (SIFT) developed by Lowe (2004), which we will describe in Sec- 
tion 7.1.2 and Figure 7.16. Unlike SIFT, however, which is only evaluated at interest point 
locations, HOGs are evaluated on a regular overlapping grid and their descriptor magnitudes 
are normalized using an even coarser grid; they are only computed at a single scale and a 
fixed orientation. To capture the subtle variations in orientation around a person’s outline, a 
large number of orientation bins are used and no smoothing is performed in the central dif- 
ference gradient computation—see Dalal and Triggs (2005) for more implementation details. 
Figure 6.23d shows a sample input image, while Figure 6.23e shows the associated HOG 


descriptors. 


Once the descriptors have been computed, a support vector machine (SVM) is trained 
on the resulting high-dimensional continuous descriptor vectors. Figures 6.23b-c show a 
diagram of the (most) positive and negative SVM weights in each block, while Figures 6.23f— 
g show the corresponding weighted HOG responses for the central input image. As you can 
see, there are a fair number of positive responses around the head, torso, and feet of the 
person, and relatively few negative responses (mainly around the middle and the neck of the 


sweater). 


Much like face detection, the fields of pedestrian and general object detection continued 
to advance rapidly in the 2000s (Belongie, Malik, and Puzicha 2002; Mikolajczyk, Schmid, 
and Zisserman 2004; Dalal and Triggs 2005; Leibe, Seemann, and Schiele 2005; Opelt, Pinz, 
and Zisserman 2006; Torralba 2007; Andriluka, Roth, and Schiele 2009; Maji and Berg 2009; 
Andriluka, Roth, and Schiele 2010; Dollar, Belongie, and Perona 2010). 


A significant advance in the field of person detection was the work of Felzenszwalb, 
McAllester, and Ramanan (2008), who extend the histogram of oriented gradients person 
detector to incorporate flexible parts models (Section 6.2.1). Each part is trained and detected 
on HOGs evaluated at two pyramid levels below the overall object model and the locations 
of the parts relative to the parent node (the overall bounding box) are also learned and used 
during recognition (Figure 6.24b). To compensate for inaccuracies or inconsistencies in the 
training example bounding boxes (dashed white lines in Figure 6.24c), the “true” location of 
the parent (blue) bounding box is considered a latent (hidden) variable and is inferred during 
both training and recognition. Since the locations of the parts are also latent, the system 
can be trained in a semi-supervised fashion, without needing part labels in the training data. 
An extension to this system (Felzenszwalb, Girshick et al. 2010), which includes among its 
improvements a simple contextual model, was among the two best object detection systems 
in the 2008 Visual Object Classes detection challenge (Everingham, Van Gool et al. 2010). 
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(d) 


Figure 6.24 Part-based object detection (Felzenszwalb, McAllester, and Ramanan 2008) 
© 2008 IEEE: (a) An input photograph and its associated person (blue) and part (yellow) 
detection results. (b) The detection model is defined by a coarse template, several higher 
resolution part templates, and a spatial model for the location of each part. (c) True positive 


detection of a skier and (d) false positive detection of a cow (labeled as a person). 


a 


Figure 6.25 Pose detection using random forests (Rogez, Rihan et al. 2008) O 2008 IEEE. 


The estimated pose (state of the kinematic model) is drawn over each input frame. 


Improvements to part-based person detection and pose estimation include work by Andriluka, 
Roth, and Schiele (2009) and Kumar, Zisserman, and Torr (2009). 

An even more accurate estimate of a person’s pose and location is presented by Rogez, 
Rihan et al. (2008), who compute both the phase of a person in a walk cycle and the locations 
of individual joints, using random forests built on top of HOGs (Figure 6.25). Since their 
system produces full 3D pose information, it is closer in its application domain to 3D person 
trackers (Sidenbladh, Black, and Fleet 2000; Andriluka, Roth, and Schiele 2010), which we 
will discussed in Section 13.6.4. When video sequences are available, the additional infor- 
mation present in the optical flow and motion discontinuities can greatly aid in the detection 
task, as discussed by Efros, Berg ef al. (2003), Viola, Jones, and Snow (2003), and Dalal, 
Triggs, and Schmid (2006). 

Since the 2000s, pedestrian and general person detection have continued to be actively 
developed, often in the context of more general multi-class object detection (Everingham, 
Van Gool et al. 2010; Everingham, Eslami et al. 2015; Lin, Maire et al. 2014). The Cal- 
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tech pedestrian detection benchmark!” and survey by Dollar, Belongie, and Perona (2010) 
introduces a new dataset and provides a nice review of algorithms through 2012, including 
Integral Channel Features (Dollar, Tu et al. 2009), the Fastest Pedestrian Detector in the West 
(Dollar, Belongie, and Perona 2010), and 3D pose estimation algorithms such as Poselets 
(Bourdev and Malik 2009). Since its original construction, this benchmark continues to tabu- 
late and evaluate more recent detectors, including Dollar, Appel, and Kienzle (2012), Dollar, 
Appel et al. (2014), and more recent algorithms based on deep neural networks (Sermanet, 
Kavukcuoglu et al. 2013; Ouyang and Wang 2013; Tian, Luo et al. 2015; Zhang, Lin et al. 
2016). The CityPersons dataset (Zhang, Benenson, and Schiele 2017) and WIDER Face and 


Person Challenge'® also report results on recent algorithms. 


6.3.3 General object detection 


While face and pedestrian detection algorithms were the earliest to be extensively studied, 
computer vision has always been interested in solving the general object detection and label- 
ing problem, in addition to whole-image classification. The PASCAL Visual Object Classes 
(VOC) Challenge (Everingham, Van Gool et al. 2010), which contained 20 classes, had both 
classification and detection challenges. Early entries that did well on the detection challenge 
include a feature-based detector and spatial pyramid matching SVM classifier by Chum and 
Zisserman (2007), a star-topology deformable part model by Felzenszwalb, McAllester, and 
Ramanan (2008), and a sliding window SVM classifier by Lampert, Blaschko, and Hofmann 
(2008). The competition was re-run annually, with the two top entries in the 2012 detection 
challenge (Everingham, Eslami et al. 2015) using a sliding window spatial pyramid matching 
(SPM) SVM (de Sande, Uijlings et al. 2011) and a University of Oxford re-implementation 
of a deformable parts model (Felzenszwalb, Girshick et al. 2010). 

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC), released in 2010, 
scaled up the dataset from around 20 thousand images in PASCAL VOC 2010 to over 1.4 
million in ILSVRC 2010, and from 20 object classes to 1,000 object classes (Russakovsky, 
Deng et al. 2015). Like PASCAL, it also had an object detection task, but it contained a much 
wider range of challenging images (Figure 6.4). The Microsoft COCO (Common Objects 
in Context) dataset (Lin, Maire ef al. 2014) contained even more objects per image, as well 
as pixel-accurate segmentations of multiple objects, enabling the study of not only semantic 
segmentation (Section 6.4), but also individual object instance segmentation (Section 6.4.2). 
Table 6.2 list some of the datasets used for training and testing general object detection algo- 


rithms. 


'T http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians 
'8https://wider-challenge.org/2019.html 
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Figure 6.26 Intersection over union (IoU): (a) schematic formula, (b) real-world example 
© 2020 Ross Girshick. 


The release of COCO coincided with a wholescale shift to deep networks for image clas- 
sification, object detection, and segmentation (Jiao, Zhang et al. 2019; Zhao, Zheng et al. 
2019). Figure 6.29 shows the rapid improvements in average precision (AP) on the COCO 
object detection task, which correlates strongly with advances in deep neural network archi- 
tectures (Figure 5.40). 


Precision vs. recall 


Before we describe the elements of modern object detectors, we should first discuss what 
metrics they are trying to optimize. The main task in object detection, as illustrated in Fig- 
ures 6.5a and 6.26b, is to put accurate bounding boxes around all the objects of interest and 
to correctly label such objects. To measure the accuracy of each bounding box (not too small 
and not too big), the common metric is intersection over union (1oU), which is also known as 
the Jaccard index or Jaccard similarity coefficient (Rezatofighi, Tsoi et al. 2019). The IoU is 
computed by taking the predicted and ground truth bounding boxes Bpr and Bgt for an object 


and computing the ratio of their area of intersection and their area of union, 
(6.6) 


as shown in Figure 6.26a. 

As we will shortly see, object detectors operate by first proposing a number of plausible 
rectangular regions (detections) and then classifying each detection while also producing a 
confidence score (Figure 6.26b). These regions are then run through some kind of non- 
maximal suppression (NMS) stage, which removes weaker detections that have too much 
overlap with stronger detections, using a greedy most-confident-first algorithm. 

To evaluate the performance of an object detector, we run through all of the detections, 
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Figure 6.27 Object detector average precision © 2020 Ross Girshick: (a) a precision- 
recall curve for a single class and IoU threshold, with the AP being the area under the P-R 


curve; (b) average precision averaged over several IoU thresholds (from looser to tighter). 


from most confident to least, and classify them as true positive TP (correct label and suffi- 
ciently high IoU) or false positive FP (incorrect label or ground truth object already matched). 
For each new decreasing confidence threshold, we can compute the precision and recall as 


TP 
precision = TP4FP (6.7) 
TP 
recall = P’ (6.8) 


where P is the number of positive examples, i.e., the number of labeled ground truth detections 
in the test image.'? (See Section 7.1.3 on feature matching for additional terms that are often 
used in measuring and describing error rates.) 

Computing the precision and recall at every confidence threshold allows us to populate a 
precision-recall curve, such as the one in Figure 6.27a. The area under this curve is called av- 
erage precision (AP). A separate AP score can be computed for each class being detected, and 
the results averaged to produce a mean average precision (mAP). Another widely used mea- 
sure if the While earlier benchmarks such as PASCAL VOC determined the mAP using a sin- 
gle IoU threshold of 0.5 (Everingham, Eslami et al. 2015), the COCO benchmark (Lin, Maire 
et al. 2014) averages the mAP over a set of IoU thresholds, IoU € (0.50, 0.55,...,0.95), as 
shown in Figure 6.27a. While this AP score continues to be widely used, an alternative 
probability-based detection quality (PDQ) score has recently been proposed (Hall, Dayoub et 
al. 2020). A smoother version of average precision called Smooth-AP has also been proposed 


and shown to have benefits on large-scale image retrieval tasks (Brown, Xie et al. 2020). 


19 Another widely reported measure is the F-score, which is the harmonic mean of the precision and recall. 
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Figure 6.28 The R-CNN and Fast R-CNN object detectors. (a) R-CNN rescales pixels 
inside each proposal region and performs a CNN + SVM classification (Girshick, Donahue 
et al. 2015) O 2015 IEEE. (b) Fast R-CNN resamples convolutional features and uses fully 
connected layers to perform classification and bounding box regression (Girshick 2015) © 
2015 IEEE. 


Modern object detectors 


The first stage in detecting objects in an image is to propose a set of plausible rectangular 
regions in which to run a classifier. The development of such region proposal algorithms was 
an active research area in the early 2000s (Alexe, Deselaers, and Ferrari 2012; Uijlings, Van 
De Sande et al. 2013; Cheng, Zhang et al. 2014; Zitnick and Dollár 2014). 

One of the earliest object detectors based on neural networks is R-CNN, the Region- 
based Convolutional Network developed by Girshick, Donahue et al. (2014). As illustrated 
in Figure 6.28a, this detector starts by extracting about 2,000 region proposals using the 
selective search algorithm of Uijlings, Van De Sande et al. (2013). Each proposed regions is 
then rescaled (warped) to a 224 square image and passed through an AlexNet or VGG neural 
network with a support vector machine (SVM) final classifier. 

The follow-on Fast R-CNN paper by Girshick (2015) interchanges the convolutional 
neural network and region extraction stages and replaces the SVM with some fully con- 
nected (FC) layers, which compute both an object class and a bounding box refinement (Fig- 
ure 6.28b). This reuses the CNN computations and leads to much faster training and test 
times, as well as dramatically better accuracy compared to previous networks (Figure 6.29). 
As you can see from Figure 6.28b, Fast R-CNN is an example of a deep network with a 
shared backbone and two separate heads, and hence two different loss functions, although 
these terms were not introduced until the Mask R-CNN paper by He, Gkioxari et al. (2017). 


The Faster R-CNN system, introduced a few month later by Ren, He et al. (2015), replaces 
the relatively slow selective search stage with a convolutional region proposal network (RPN), 
resulting in much faster inference. After computing convolutional features, the RPN suggests 


at each coarse location a number of potential anchor boxes, which vary in shape and size 


6.3 Object detection 383 


Past Early “4 years Late 
(best circa 2015 2018 
2012) 


Progress within 


49 
36 39 
DL methods: 29 
>3x! 19 
15 
al 


DPM Fast R-CNN Fast R-CNN Faster R-CNN Faster R-CNN Faster R-CNN Mask R-CNN 
(Pre DL) (AlexNet) (VGG-16) (VGG-16) (ResNet-50) (R-101-FPN) (X-152-FPN) 


Figure 6.29 Best average precision (AP) results by year on the COCO object detection 
task (Lin, Maire et al. 2014) O 2020 Ross Girshick. 


to accommodate different potential objects. Each proposal is then classified and refined by 
an instance of the Fast R-CNN heads and the final detections are ranked and merged using 
non-maximal suppression. 

R-CNN, Fast R-CNN, and Faster R-CNN all operate on a single resolution convolutional 
feature map (Figure 6.30b). To obtain better scale invariance, 1t would be preferable to operate 
on a range of resolutions, e.g, by computing a feature map at each image pyramid level, as 
shown in Figure 6.30a, but this is computationally expensive. We could, instead, simply 
start with the various levels inside the convolutional network (Figure 6.30c), but these levels 
have different degrees of semantic abstraction, i.e., higher/smaller levels are attuned to more 
abstract constructs. The best solution is to construct a Feature Pyramid Network (FPN), as 
shown in Figure 6.30d, where top-down connections are used to endow higher-resolution 
(lower) pyramid levels with the semantics inferred at higher levels (Lin, Dollar et al. 2017).7° 
This additional information significantly enhances the performance of object detectors (and 
other downstream tasks) and makes their behavior much less sensitive to object size. 

DETR (Carion, Massa et al. 2020) uses a simpler architecture that eliminates the use of 
non-maximum suppression and anchor generation. Their model consists of a ResNet back- 
bone that feeds into a transformer encoder-decoder. At a high level, it makes N bounding 
box predictions, some of which may include the “no object class”. The ground truth bound- 
ing boxes are also padded with “no object class” bounding boxes to obtain NV total bounding 
boxes. During training, bipartite matching is then used to build a one-to-one mapping from 
every predicted bounding box to a ground truth bounding box, with the chosen mapping lead- 


20Tt’s interesting to note that the human visual system is full of such re-entrant or feedback pathways (Gilbert and 
Li 2013), although the extent to which cognition influences perception is still being debated (Firestone and Scholl 
2016). 
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Figure 6.30 A Feature Pyramid Network and its precursors (Lin, Dollár et al. 2017) © 
2017 IEEE: (a) deep features extracted at each level in an image pyramid; (b) a single 
low-resolution feature map; (c) a deep feature pyramid, with higher levels having greater 


abstraction; (d) a Feature Pyramid Network, with top-down context for all levels. 


ing to the lowest possible cost. The overall training loss is then the sum of the losses between 
the matched bounding boxes. They find that their approach is competitive with state-of-the- 
art object detection performance on COCO. 


Single-stage networks 


In the architectures we’ve looked at so far, a region proposal algorithm or network selects 
the locations and shapes of the detections to be considered, and a second network is then 
used to classify and regress the pixels or features inside each region. An alternative is to use 
a single-stage network, which uses a single neural network to output detections at a variety 
of locations. Two examples of such detectors are SSD (Single Shot MultiBox Detector) 
from Liu, Anguelov ef al. (2016) and the family of YOLO (You Only Look Once) detectors 
described in Redmon, Divvala et al. (2016),Redmon and Farhadi (2017), and Redmon and 
Farhadi (2018). RetinaNet (Lin, Goyal et al. 2017) is also a single-stage detector built on 
top of a feature pyramid network. It uses a focal loss to focus the training on hard examples 
by downweighting the loss on well-classified samples, thus preventing the larger number of 
easy negatives from overwhelming the training. These and more recent convolutional object 
detectors are described in the recent survey by Jiao, Zhang et al. (2019). Figure 6.31 shows 
the speed and accuracy of detectors published up through early 2017. 
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Figure 6.31 Speed/accuracy trade-offs for convolutional object detectors: (a) (Huang, 
Rathod et al. 2017) O 2017 IEEE; (b) YOLOv4 O Bochkovskiy, Wang, and Liao (2020). 


The latest in the family of YOLO detectors is YOLOv4 by Bochkovskiy, Wang, and Liao 
(2020). In addition to outperforming other recent fast detectors such as EfficientDet (Tan, 
Pang, and Le 2020), as shown in Figure 6.31b, the paper breaks the processing pipeline into 
several stages, including a neck, which performs the top-down feature enhancement found 
in the feature pyramid network. The paper also evaluates many different components, which 
they categorize into a “bag of freebies” that can be used during training and a “bag of specials” 


that can be used at detection time with minimal additional cost. 


While most bounding box object detectors continue to evaluate their results on the COCO 
dataset (Lin, Maire et al. 2014)?! newer datasets such as Open Images (Kuznetsova, Rom et 
al. 2020), and LVIS: Large Vocabulary Instance Segmentation (Gupta, Dollar, and Girshick 
2019) are now also being used (see Table 6.2). Two recent workshops that highlight the latest 
results using these datasets are Zendel et al. (2020) and Kirillov, Lin et al. (2020) and also 
have challenges related to instance segmentation, panoptic segmentation, keypoint estima- 
tion, and dense pose estimation, which are topics we discuss later in this chapter. Open-source 
frameworks for training and fine-tuning object detectors include the TensorFlow Object De- 
tection API? and PyTorch’s Detectron2.”* 
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Contents/Reference 


Object recognition 


Oxford buildings dataset Pictures of buildings 
https://www.robots.ox.ac.uk/~vgg/data/oxbuildings 

INRIA Holidays 
https://lear.inrialpes.fr/people/jegou/data.php 

PASCAL Segmentations, boxes 


http://host.robots.ox.ac.uk/pascal/VOC 


Holiday scenes 


ImageNet Complete images 


https://www.image-net.org 


Fashion MNIST 
https://github.com/zalandoresearch/fashion- mnist 


Complete images 


5,062 images 
Philbin, Chum et al. (2007) 
1,491 images 
Jégou, Douze, and Schmid (2008) 
11k images (2.9k with segmentations) 
Everingham, Eslami et al. (2015) 
21k (WordNet) classes, 14M images 
Deng, Dong et al. (2009) 
70k fashion products 
Xiao, Rasul, and Vollgraf (2017) 


Object detection and segmentation 


Caltech Pedestrian Dataset Bounding boxes 


http://www. vision.caltech.edu/Image_Datasets/CaltechPedestrians 
MSR Cambridge Per-pixel segmentations 


https://www.microsoft.com/en-us/research/project/image- understanding 


LabelMe dataset Polygonal boundaries 
http://labelme.csail.mit.edu 
Microsoft COCO Segmentations, boxes 


https://cocodataset.org 
Cityscapes Polygonal boundaries 
https://www.cityscapes-dataset.com 
Broden Segmentation masks 
http://netdissect.csail.mit.edu 
Broden+ Segmentation masks 
https://github.com/CS AILVision/unifiedparsing 
LVIS 


https://www.lvisdataset.org 


Instance segmentations 


Open Images Segs., relationships 


https://g.co/dataset/openimages 


Table 6.2 


Pedestrians 
Dollár, Wojek et al. (2009) 
23 classes 
Shotton, Winn et al. (2009) 
>500 categories 
Russell, Torralba et al. (2008) 
330k images 
Lin, Maire et al. (2014) 
30 classes, 25,000 images 
Cordts, Omran et al. (2016) 
A variety of visual concepts 
Bau, Zhou et al. (2017) 
A variety of visual concepts 
Xiao, Liu et al. (2018) 
1,000 categories, 2.2M images 
Gupta, Dollár, and Girshick (2019) 
478k images, 3M relationships 
Kuznetsova, Rom et al. (2020) 


Image databases for classification, detection, and localization. 
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(a) (b) 


Figure 6.32 Examples of image segmentation (Kirillov, He et al. 2019) O 2019 IEEE: (a) 


original image; (b) semantic segmentation (per-pixel classification); (c) instance segmenta- 


(d) 


tion (delineate each object); (d) panoptic segmentation (label all things and stuff). 


6.4 Semantic segmentation 


A challenging version of general object recognition and scene understanding is to simul- 
taneously perform recognition and accurate boundary segmentation (Fergus 2007). In this 
section, we examine a number of related problems, namely semantic segmentation (per-pixel 
class labeling), instance segmentation (accurately delineating each separate object), panoptic 
segmentation (labeling both objects and stuff), and dense pose estimation (labeling pixels be- 
longing to people and their body parts). Figures 6.32 and 6.43 show some of these kinds of 
segmentations. 

The basic approach to simultaneous recognition and segmentation is to formulate the 
problem as one of labeling every pixel in an image with its class membership. Older ap- 
proaches often did this using energy minimization or Bayesian inference techniques, i.e., 
conditional random fields (Section 4.3.1). The TextonBoost system of Shotton, Winn et al. 
(2009) uses unary (pixel-wise) potentials based on image-specific color distributions (Sec- 
tion 4.3.2), location information (e.g., foreground objects are more likely to be in the middle 
of the image, sky is likely to be higher, and road is likely to be lower), and novel texture- 
layout classifiers trained using shared boosting. It also uses traditional pairwise potentials 
that look at image color gradients. The texton-layout features first filter the image with a 
series of 17 oriented filter banks and then cluster the responses to classify each pixel into 30 
different texton classes (Malik, Belongie ef al. 2001). The responses are then filtered using 
offset rectangular regions trained with joint boosting (Viola and Jones 2004) to produce the 
texton-layout features used as unary potentials. Figure 6.33 shows some examples of images 
successfully labeled and segmented using TextonBoost 


The TextonBoost conditional random field framework has been extended to LayoutCRFs 


21 See https://codalab.org for the latest competitions and leaderboards. 
>? https://github.com/tensorflow/models/tree/master/research/object_detection 
3https://github.com/facebookresearch/detectron2 
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Figure 6.33 Simultaneous recognition and segmentation using TextonBoost (Shotton, Winn 
et al. 2009) © 2009 Springer. 


by Winn and Shotton (2006), who incorporate additional constraints to recognize multiple 
object instances and deal with occlusions, and by Hoiem, Rother, and Winn (2007) to incor- 
porate full 3D models. Conditional random fields continued to be widely used and extended 
for simultaneous recognition and segmentation applications, as described in the first edition 
of this book (Szeliski 2010, Section 14.4.3), along with approaches that first performed low- 
level or hierarchical segmentations (Section 7.5). 


The development of fully convolutional networks (Long, Shelhamer, and Darrell 2015), 
which we described in Section 5.4.1, enabled per-pixel semantic labeling using a single neu- 
ral network. While the first networks suffered from poor resolution (very loose boundaries), 
the addition of conditional random fields at a final stage (Chen, Papandreou et al. 2018; 
Zheng, Jayasumana et al. 2015), deconvolutional upsampling (Noh, Hong, and Han 2015), 
and fine-level connections in U-nets (Ronneberger, Fischer, and Brox 2015), all helped im- 
prove accuracy and resolution. 


Modern semantic segmentation systems are often built on architectures such as the fea- 
ture pyramid network (Lin, Dollár et al. 2017), which have top-down connections to help 
percolate semantic information down to higher-resolution maps. For example, the Pyramid 
Scene Parsing Network (PSPNet) of Zhao, Shi et al. (2017) uses spatial pyramid pooling (He, 
Zhang et al. 2015) to aggregate features at various resolution levels. The Unified Perceptual 
Parsing network (UPerNet) of Xiao, Liu et al. (2018) uses both a feature pyramid network 
and a pyramid pooling module to label image pixels not only with object categories but also 
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Figure 6.34 The UPerNet framework for Unified Perceptual Parsing (Xiao, Liu et al. 2018) 
© 2018 Springer. A Feature Pyramid Network (FPN) backbone is appended with a Pyramid 
Pooling Module (PPM) before feeding it into the top-down branch of the FPN. before feeding 
it into the top-down branch of the FPN. Various layers of the FPN and/or PPM are fed into 
different heads, including a scene head for image classification, object and part heads from 
the fused FPN features, a material head operating on the finest level of the FPN, and a texture 
head that does not participate in the FPN fine tuning. The bottom gray squares give more 


details into some of the heads. 


materials, parts, and textures, as shown in Figure 6.34. HRNet (Wang, Sun et al. 2020) keeps 
high-resolution versions of feature maps throughout the pipeline with occasional interchange 
of information between channels at different resolution layers. Such networks can also be 
used to estimate surface normals and depths in an image (Huang, Zhou ef al. 2019; Wang, 
Geraghty et al. 2020). 

Semantic segmentation algorithms were initially trained and tested on datasets such as 
MSRC (Shotton, Winn et al. 2009) and PASCAL VOC (Everingham, Eslami et al. 2015). 
More recent datasets include the Cityscapes dataset for urban scene understanding (Cordts, 
Omran et al. 2016) and ADE20K (Zhou, Zhao et al. 2019), which labels pixels in a wider 
variety of indoor and outdoor scenes with 150 different category and part labels. The Broadly 
and Densely Labeled Dataset (Broden) created by Bau, Zhou et al. (2017) federates a number 
of such densely labeled datasets, including ADE20K, Pascal-Context, Pascal-Part, OpenSur- 


faces, and Describable Textures to obtain a wide range of labels such as materials and textures 
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FLAIR 1.16 Prediction True Label 


Figure 6.35 3D volumetric medical image segmentation using a deep network (Kamnitsas, 
Ferrante et al. 2016) O 2016 Springer. 


in addition to basic object semantics. While this dataset was originally developed to aid in 
the interpretability of deep networks, it has also proven useful (with extensions) for training 
unified multi-task labeling systems such as UPerNet (Xiao, Liu et al. 2018). Table 6.2 list 
some of the datasets used for training and testing semantic segmentation algorithms. 

One final note. While semantic image segmentation and labeling have widespread ap- 
plications in image understanding, the converse problem of going from a semantic sketch or 
painting of a scene to a photorealistic image has also received widespread attention (Johnson, 
Gupta, and Fei-Fei 2018; Park, Liu et al. 2019; Bau, Strobelt et al. 2019; Ntavelis, Romero et 
al. 2020b). We look at this topic in more detail in Section 10.5.3 on semantic image synthesis. 


6.4.1 Application: Medical image segmentation 


One of the most promising applications of image segmentation is in the medical imaging 
domain, where it can be used to segment anatomical tissues for later quantitative analysis. 
Figure 4.21 shows a binary graph cut with directed edges being used to segment the liver tis- 
sue (light gray) from its surrounding bone (white) and muscle (dark gray) tissue. Figure 6.35 
shows the segmentation of a brain scan for the detection of brain tumors. Before the de- 
velopment of the mature optimization and deep learning techniques used in modern image 
segmentation algorithms, such processing required much more laborious manual tracing of 
individual X-ray slices. 

Initially, optimization techniques such as Markov random fields (Section 4.3.2) and dis- 
criminative classifiers such as random forests (Section 5.1.5) were used for medical image 
segmentation (Criminisi, Robertson et al. 2013). More recently, the field has shifted to deep 
learning approaches (Kamnitsas, Ferrante et al. 2016; Kamnitsas, Ledig et al. 2017; Havaei, 
Davy et al. 2017). 

The fields of medical image segmentation (McInerney and Terzopoulos 1996) and med- 


ical image registration (Kybic and Unser 2003) (Section 9.2.3) are rich research fields with 
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Figure 6.36 Instance segmentation using Mask R-CNN (He, Gkioxari et al. 2017) © 2017 


IEEE: (a) system architecture, with an additional segmentation branch; (b) sample results. 


their own specialized conferences, such as Medical Imaging Computing and Computer As- 
sisted Intervention (MICCAI), and journals, such as Medical Image Analysis and IEEE Trans- 
actions on Medical Imaging. These can be great sources of references and ideas for research 


in this area. 


6.4.2 Instance segmentation 


Instance segmentation is the task of finding all of the relevant objects in an image and pro- 
ducing pixel-accurate masks for their visible regions (Figure 6.36b). One potential approach 
to this task is to perform known object instance recognition (Section 6.1) and to then back- 
project the object model into the scene (Lowe 2004), as shown in Figure 6.1d, or matching 
portions of the new scene to pre-learned (segmented) object models (Ferrari, Tuytelaars, and 
Van Gool 2006b; Kannala, Rahtu et al. 2008). However, this approach only works for known 
rigid 3D models. 

For more complex (flexible) object models, such as those for humans, a different approach 
is to pre-segment the image into larger or smaller pieces (Section 7.5) and to then match such 
pieces to portions of the model (Mori, Ren et al. 2004; Mori 2005; He, Zemel, and Ray 2006; 
Gu, Lim et al. 2009). For general highly variable classes, a related approach is to vote for 
potential object locations and scales based on feature correspondences and to then infer the 
object extents (Leibe, Leonardis, and Schiele 2008). 

With the advent of deep learning, researchers started combining region proposals or image 
pre-segmentations with convolutional second stages to infer the final instance segmentations 
(Hariharan, Arbeláez et al. 2014; Hariharan, Arbeláez et al. 2015; Dai, He, and Sun 2015; 
Pinheiro, Lin et al. 2016; Dai, He, and Sun 2016; Li, Qi et al. 2017). 

A breakthrough in instance segmentation came with the introduction of Mask R-CNN 
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Figure 6.37 Person keypoint detection and segmentation using Mask R-CNN (He, Gkioxari 
et al. 2017) © 2017 IEEE 


(He, Gkioxari et al. 2017). As shown in Figure 6.36a, Mask R-CNN uses the same region 
proposal network as Faster R-CNN (Ren, He et al. 2015), but then adds an additional branch 
for predicting the object mask, in addition to the existing branch for bounding box refine- 
ment and classification.2* As with other networks that have multiple branches (or heads) and 
outputs, the training losses corresponding to each supervised output need to be carefully bal- 
anced. It is also possible to add additional branches, e.g., branches trained to detect human 
keypoint locations (implemented as per-keypoint mask images), as shown in Figure 6.37. 

Since its introduction, the performance of Mask R-CNN and its extensions has continued 
to improve with advances in backbone architectures (Liu, Qi et al. 2018; Chen, Pang et al. 
2019). Two recent workshops that highlight the latest results in this area are the COCO + 
LVIS Joint Recognition Challenge (Kirillov, Lin et al. 2020) and the Robust Vision Challenge 
(Zendel et al. 2020).% It is also possible to replace the pixel masks produced by most instance 
segmentation techniques with time-evolving closed contours, i.e., “snakes” (Section 7.3.1), 
as in Peng, Jiang et al. (2020). In order to encourage higher-quality segmentation boundaries, 
Cheng, Girshick et al. (2021) propose a new Boundary Intersection-over-Union (Boundary 
loU) metric to replace the commonly used Mask IoU metric. 


6.4.3 Panoptic segmentation 


As we have seen, semantic segmentation classifies each pixel in an image into its semantic 
category, i.e., what stuff does each pixel correspond to. Instance segmentation associates 


pixels with individual objects, i.e., how many objects are there and what are their extents 


24Mask R-CNN was the first paper to introduce the terms backbone and head to describe the common deep 
convolutional feature extraction front end and the specialized back end branches. 
25You can find the leaderboards for instance segmentation and other COCO recognition tasks at https:// 


cocodataset.org. 
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Figure 6.38 Panoptic segmentation results produced using a Panoptic Feature Pyramid 
Network (Kirillov, Girshick et al. 2019) O 2019 IEEE. 


Figure 6.39  Detectron2 panoptic segmentation results on some of my personal photos. 
(Click on the “Colab Notebook” link at https://github.com/facebookresearch/detectron2 and 
then edit the input image URL to try your own.) 


(Figure 6.32). Putting both of these systems together has long been a goal of semantic scene 
understanding (Yao, Fidler, and Urtasun 2012; Tighe and Lazebnik 2013; Tu, Chen et al. 
2005). Doing this on a per-pixel level results in a panoptic segmentation of the scene, where 
all of the objects are correctly segmented and the remaining stuff is correctly labeled (Kir- 
illov, He et al. 2019). Producing a sensible panoptic quality (PQ) metric that simultaneously 
balances the accuracy on both tasks takes some careful design. In their paper, Kirillov, He 
et al. (2019) describe their proposed metric and analyze the performance of both humans (in 
terms of consistency) and recent algorithms on three different datasets. 

The COCO dataset has now been extended to include a panoptic segmentation task, on 
which some recent results can be found in the ECCV 2020 workshop on this topic (Kirillov, 
Lin et al. 2020). Figure 6.38 show some segmentations produced by the panoptic feature 
pyramid network described by Kirillov, Girshick et al. (2019), which adds two branches for 
instance segmentation and semantic segmentation to a feature pyramid network. 
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(a) (b) (c) (d) 


Figure 6.40 Scene completion using millions of photographs (Hays and Efros 2007) O 
2007 ACM: (a) original image; (b) after unwanted foreground removal; (c) plausible scene 
matches, with the one the user selected highlighted in red; (d) output image after replacement 


and blending. 


6.4.4 Application: Intelligent photo editing 


Advances in object recognition and scene understanding have greatly increased the power of 
intelligent (semi-automated) photo editing applications. One example is the Photo Clip Art 
system of Lalonde, Hoiem et al. (2007), which recognizes and segments objects of interest, 
such as pedestrians, in internet photo collections and then allows users to paste them into their 
own photos. Another is the scene completion system of Hays and Efros (2007), which tackles 
the same inpainting problem we will study in Section 10.5. Given an image in which we wish 
to erase and fill in a large section (Figure 6.40a—b), where do you get the pixels to fill in the 
gaps in the edited image? Traditional approaches either use smooth continuation (Bertalmio, 
Sapiro et al. 2000) or borrow pixels from other parts of the image (Efros and Leung 1999; 
Criminisi, Pérez, and Toyama 2004; Efros and Freeman 2001). With the availability of huge 
numbers of images on the web, it often makes more sense to find a different image to serve 
as the source of the missing pixels. 

In their system, Hays and Efros (2007) compute the gist of each image (Oliva and Torralba 
2001; Torralba, Murphy et al. 2003) to find images with similar colors and composition. They 
then run a graph cut algorithm that minimizes image gradient differences and composite the 
new replacement piece into the original image using Poisson image blending (Section 8.4.4) 
(Pérez, Gangnet, and Blake 2003). Figure 6.40d shows the resulting image with the erased 
foreground rooftops region replaced with sailboats. Additional examples of photo editing and 
computational photography applications enabled by what has been dubbed “internet computer 
vision” can be found in the special journal issue edited by Avidan, Baker, and Shan (2010). 

A different application of image recognition and segmentation is to infer 3D structure 


from a single photo by recognizing certain scene structures. For example, Criminisi, Reid, 
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b> 
(a) 


Figure 6.41 Automatic photo pop-up (Hoiem, Efros, and Hebert 2005a) O 2005 ACM: 


(a) input image; (b) superpixels are grouped into (c) multiple regions; (d) labels indicating 


(e) 


ground (green), vertical (red), and sky (blue); (e) novel view of resulting piecewise-planar 
3D model. 


and Zisserman (2000) detect vanishing points and have the user draw basic structures, such 
as walls, to infer the 3D geometry (Section 11.1.2). Hoiem, Efros, and Hebert (2005a), on 
the other hand, work with more “organic” scenes such as the one shown in Figure 6.41. Their 
system uses a variety of classifiers and statistics learned from labeled images to classify each 
pixel as either ground, vertical, or sky (Figure 6.41d). To do this, they begin by computing 
superpixels (Figure 6.41b) and then group them into plausible regions that are likely to share 
similar geometric labels (Figure 6.41c). After all the pixels have been labeled, the boundaries 
between the vertical and ground pixels can be used to infer 3D lines along which the image 
can be folded into a “pop-up” (after removing the sky pixels), as shown in Figure 6.4le. In 
related work, Saxena, Sun, and Ng (2009) develop a system that directly infers the depth and 
orientation of each pixel instead of using just three geometric class labels. We will examine 


techniques to infer depth from single images in more detail in Section 12.8. 


6.4.5 Pose estimation 


The inference of human pose (head, body, and limb locations and attitude) from a single 
images can be viewed as yet another kind of segmentation task. We have already discussed 
some pose estimation techniques in Section 6.3.2 on pedestrian detection section, as shown 
in Figure 6.25. Starting with the seminal work by Felzenszwalb and Huttenlocher (2005), 
2D and 3D pose detection and estimation rapidly developed as an active research area, with 
important advances and datasets (Sigal and Black 2006a; Rogez, Rihan et al. 2008; Andriluka, 
Roth, and Schiele 2009; Bourdev and Malik 2009; Johnson and Everingham 2011; Yang 
and Ramanan 2011; Pishchulin, Andriluka et al. 2013; Sapp and Taskar 2013; Andriluka, 
Pishchulin et al. 2014). 

More recently, deep networks have become the preferred technique to identify human 
body keypoints in order to convert these into pose estimates (Tompson, Jain et al. 2014; 
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Figure 6.42 OpenPose real-time multi-person 2D pose estimation (Cao, Simon et al. 2017) 
O 2017 IEEE. 


Toshev and Szegedy 2014; Pishchulin, Insafutdinov et al. 2016; Wei, Ramakrishna et al. 
2016; Cao, Simon et al. 2017; He, Gkioxari et al. 2017; Hidalgo, Raaj et al. 2019; Huang, 
Zhu et al. 2020).2 Figure 6.42 shows some of the impressive real-time multi-person 2D pose 
estimation results produced by the OpenPose system (Cao, Hidalgo et al. 2019). 

The latest, most challenging, task in human pose estimation is the DensePose task intro- 
duced by Giiler, Neverova, and Kokkinos (2018), where the task is to associate each pixel in 
RGB images of people with 3D points on a surface-based model, as shown in Figure 6.43. 
The authors provide dense annotations for 50,000 people appearing in COCO images and 
evaluate a number of correspondence networks, including their own DensePose-RCNN with 
several extensions. A more in-depth discussion on 3D human body modeling and tracking 
can be found in Section 13.6.4. 


6.5 Video understanding 


As we've seen in the previous sections of this chapter, image understanding mostly concerns 
itself with naming and delineating the objects and stuff in an image, although the relation- 
ships between objects and people are also sometimes inferred (Yao and Fei-Fei 2012; Gupta 


26 You can find the leaderboards for human keypoint detection at https://cocodataset.org. 
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Figure 6.43 Dense pose estimation aims at mapping all human pixels of an RGB image 
to the 3D surface of the human body (Güler, Neverova, and Kokkinos 2018) © 2018 IEEE. 
The paper describes DensePose-COCO, a large-scale ground-truth dataset containing man- 
ually annotated image-to-surface correspondences for 50K persons and a DensePose-RCNN 


trained to densely regress UV coordinates at multiple frames per second. 


and Malik 2015; Yatskar, Zettlemoyer, and Farhadi 2016; Gkioxari, Girshick et al. 2018). 
(We will look at the topic of describing complete images in the next section on vision and 
language.) 

What, then, is video understanding? For many researchers, it starts with the detection and 
description of human actions, which are taken as the basic atomic units of videos. Of course, 
just as with images, these basic primitives can be chained into more complete descriptions of 
longer video sequences. 

Human activity recognition began being studied in the 1990s, along with related topics 
such as human motion tracking, which we discuss in Sections 9.4.4 and 13.6.4. Aggarwal 
and Cai (1999) provide a comprehensive review of these two areas, which they call human 
motion analysis. Some of the techniques they survey use point and mesh tracking, as well as 
spatio-temporal signatures. 

In the 2000s, attention shifted to spatio-temporal features, such as the clever use of op- 
tical flow in small patches to recognize sports activities (Efros, Berg et al. 2003) or spatio- 
temporal feature detectors for classifying actions in movies (Laptev, Marszalek et al. 2008), 
later combined with image context (Marszalek, Laptev, and Schmid 2009) and tracked fea- 
ture trajectories (Wang and Schmid 2013). Poppe (2010), Aggarwal and Ryoo (2011), and 
Weinland, Ronfard, and Boyer (2011) provide surveys of algorithms from this decade. Some 
of the datasets used in this research include the KTH human motion dataset (Schüldt, Laptev, 
and Caputo 2004), the UCF sports action dataset (Rodriguez, Ahmed, and Shah 2008), the 
Hollywood human action dataset (Marszalek, Laptev, and Schmid 2009), UCF-101 (Soomro, 
Zamir, and Shah 2012), and the HMDB human motion database (Kuehne, Jhuang et al. 2011). 


In the last decade, video understanding techniques have shifted to using deep networks 
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Figure 6.44 Video understanding using neural networks: (a) two-stream architecture for 
video classification © Simonyan and Zisserman (2014a); (b) some alternative video pro- 
cessing architectures (Carreira and Zisserman 2017) © 2017 IEEE; (c) a SlowFast network 
with a low frame rate, low temporal resolution Slow pathway and a high frame rate, higher 
temporal resolution Fast pathway (Feichtenhofer, Fan et al. 2019) © 2019 IEEE. 
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(Ji, Xu et al. 2013; Karpathy, Toderici et al. 2014; Simonyan and Zisserman 2014a; Tran, 
Bourdev et al. 2015; Feichtenhofer, Pinz, and Zisserman 2016; Carreira and Zisserman 2017; 
Varol, Laptev, and Schmid 2017; Wang, Xiong et al. 2019; Zhu, Li et al. 2020), sometimes 
combined with temporal models such as LSTMs (Baccouche, Mamalet et al. 2011; Donahue, 
Hendricks et al. 2015; Ng, Hausknecht et al. 2015; Srivastava, Mansimov, and Salakhudinov 
2015). 

While it is possible to apply these networks directly to the pixels in the video stream, e.g., 
using 3D convolutions (Section 5.5.1), researchers have also investigated using optical flow 
(Chapter 9.3) as an additional input. The resulting two-stream architecture was proposed by 
Simonyan and Zisserman (2014a) and is shown in Figure 6.44a. A later paper by Carreira 
and Zisserman (2017) compares this architecture to alternatives such as 3D convolutions on 
the pixel stream as well as hybrids of two streams and 3D convolutions (Figure 6.44b). 

The latest architectures for video understanding have gone back to using 3D convolutions 
on the raw pixel stream (Tran, Wang et al. 2018, 2019; Kumawat, Verma et al. 2021). Wu, 
Feichtenhofer et al. (2019) store 3D CNN features into what they call a long-term feature 
bank to give a broader temporal context for action recognition. Feichtenhofer, Fan et al. 
(2019) propose a two-stream SlowFast architecture, where a slow pathway operates at a lower 
frame rate and is combined with features from a fast pathway with higher temporal sampling 
but fewer channels (Figure 6.44c). Some widely used datasets used for evaluating these 
algorithms are summarized in Table 6.3. They include Charades (Sigurdsson, Varol et al. 
2016), YouTube8M (Abu-El-Haija, Kothari et al. 2016), Kinetics (Carreira and Zisserman 
2017), “Something-something” (Goyal, Kahou et al. 2017), AVA (Gu, Sun et al. 2018), EPIC- 
KITCHENS (Damen, Doughty ef al. 2018), and AVA-Kinetics (Li, Thotakuri et al. 2020). A 
nice exposition of these and other video understanding algorithms can be found in Johnson 
(2020, Lecture 18). 

As with image recognition, researchers have also started using self-supervised algorithms 
to train video understanding systems. Unlike images, video clips are usually multi-modal, 
i.e., they contain audio tracks in addition to the pixels, which can be an excellent source of 
unlabeled supervisory signals (Alwassel, Mahajan ef al. 2020; Patrick, Asano et al. 2020). 
When available at inference time, audio signals can improve the accuracy of such systems 
(Xiao, Lee et al. 2020). 

Finally, while action recognition is the main focus of most recent video understanding 
work, it is also possible to classify videos into different scene categories such as “beach”, 
“fireworks”, or “snowing.” This problem is called dynamic scene recognition and can be 
addressed using spatio-temporal CNNs (Feichtenhofer, Pinz, and Wildes 2017). 
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Name/URL Metadata Contents/Reference 

Charades Actions, objects, descriptions 9.8k videos 
https://prior.allenai.org/projects/charades Sigurdsson, Varol et al. (2016) 

YouTube8M Entities 4.8k visual entities, 83M videos 
https://research.google.com/youtubegm Abu-El-Haija, Kothari et al. (2016) 

Kinetics Action classes 700 action classes, 650k videos 
https://deepmind.com/research/open-source/kinetics Carreira and Zisserman (2017) 

“Something-something” Actions with objects 174 actions, 220k videos 
https://20bn.com/datasets/something-something Goyal, Kahou et al. (2017) 

AVA Actions 80 actions in 430 15-minute videos 
https://research.google.com/ava Gu, Sun et al. (2018) 

EPIC-KITCHENS Actions and objects 100 hours of egocentric videos 
https://epic-kitchens.github.io Damen, Doughty et al. (2018) 


Table 6.3 Datasets for video understanding and action recognition. 


6.6 Vision and language 


The ultimate goal of much of computer vision research is not just to solve simpler tasks 
such as building 3D models of the world or finding relevant images, but to become an es- 
sential component of artificial general intelligence (AGI). This requires vision to integrate 
with other components of artificial intelligence such as speech and language understanding 
and synthesis, logical inference, and commonsense and specialized knowledge representation 
and reasoning. 

Advances in speech and language processing have enabled the widespread deployment of 
speech-based intelligent virtual assistants such as Siri, Google Assistant, and Alexa. Earlier in 
this chapter, we’ve seen how computer vision systems can name individual objects in images 
and find similar images by appearance or keywords. The next natural step of integration 
with other AI components is to merge vision and language, i.e., natural language processing 
(NLP). 

While this area has been studied for a long time (Duygulu, Barnard et al. 2002; Farhadi, 
Hejrati et al. 2010), the last decade has seen a rapid increase in performance and capabilities 
(Mogadala, Kalimuthu, and Klakow 2021; Gan, Yu et al. 2020). An example of this is the 
BabyTalk system developed by Kulkarni, Premraj et al. (2013), which first detects objects, 
their attributes, and their positional relationships, then infers a likely compatible labeling of 


these objects, and finally generates an image caption, as shown in Figure 6.45a. 
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6) Generated Sentences 


This is a photograph of one 
person and one brown sofa 
and one dog. The person is 
against the brown sofa. And 
the dog is near the person, 
and beside the brown sofa. 


4) Constructed CRF 


E 


5) Predicted Labeling 
<<null, person_b>,against,<brown,sofa_c>> 
<<null,dog_a>near,<null,person_b>> 
<<null,dog_a>,beside,<brown,sofa_o> 


man wearing a black shirt 


red shirt on aman elephant is standing 
elephant is brown 


roof of a 
building 


green trees 


in the 
background 


leg of an 


balis [S d A elephant 


leg of an 


shadow on 
the ground 


ain on the track í 
yellow. grass is green, green trees in the background. 
photo taken during the day. red train car. 


A teddy bear with 


ard bonion it A train is traveling down the tracks near a forest. 


Image captioning systems: (a) BabyTalk detects objects, attributes, and posi- 


tional relationships and composes these into image captions (Kulkarni, Premraj et al. 2013) 
© 2013 IEEE; (b-c) DenseCap associates word phrases with regions and then uses an RNN 
to construct plausible sentences (Johnson, Karpathy, and Fei-Fei 2016) © 2016 IEEE. 
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A woman is throwing a frisbee in a park, A dog is standing on a hardwood floor. A stop sign is on a road with a 
> mountain in the background, 


(a) 


A close up of a stuffed animal A person is sitting at a table A teddy bear sitting on a table A Mr. Ted sitting at a table with 
ona plate. with a sandwich. with a plate of food. a pie and a cup of coffee. 


(b) 


Figure 6.46 Image captioning with attention: (a) The “Show, Attend, and Tell” system, 
which uses hard attention to align generated words with image regions © Xu, Ba et al. (2015); 
(b) Neural Baby Talk captions generated using different detectors, showing the association 
between words and grounding regions (Lu, Yang et al. 2018) © 2018 IEEE. 


Visual captioning 


The next few years brought a veritable explosion of papers on the topic of image caption- 
ing and description, including (Chen and Lawrence Zitnick 2015; Donahue, Hendricks et 
al. 2015; Fang, Gupta ef al. 2015; Karpathy and Fei-Fei 2015; Vinyals, Toshev et al. 2015; 
Xu, Ba et al. 2015; Johnson, Karpathy, and Fei-Fei 2016; Yang, He et al. 2016; You, Jin 
et al. 2016). Many of these systems combine CNN-based image understanding components 
(mostly object and human action detectors) with RNNs or LSTMs to generate the description, 
often in conjunction with other techniques such as multiple instance learning, maximum en- 
tropy language models, and visual attention. One somewhat surprising early result was that 
nearest-neighbor techniques, i.e., finding sets of similar looking images with captions and 
then creating a consensus caption, work surprisingly well (Devlin, Gupta et al. 2015). 


Over the last few years, attention-based systems have continued to be essential compo- 
nents of image captioning systems (Lu, Xiong et al. 2017; Anderson, He et al. 2018; Lu, Yang 
et al. 2018). Figure 6.46 shows examples from two such papers, where each word in the gen- 
erated caption is grounded with a corresponding image region. The CVPR 2020 tutorial by 
(Zhou 2020) summarizes over two dozen related papers from the last five years, including 
papers that use transformers (Section 5.5.3) to do the captioning. It also covers video descrip- 
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NO LABEL LABELED “IPOD” LABELED “LIBRARY” LABELED “PIZZA” 


Figure 6.47 An adversarial typographic attack used against CLIP (Radford, Kim et al. 
2021) discovered by OGoh, Cammarata et al. (2021). Instead of predicting the object that 


exists in the scene, CLIP predicts the output based on the adversarial handwritten label. 


tion and dense video captioning (Aafaq, Mian et al. 2019; Zhou, Kalantidis et al. 2019) and 
vision-language pre-training (Sun, Myers et al. 2019; Zhou, Palangi ef al. 2020; Li, Yin et al. 
2020). The tutorial also has lectures on visual question answering and reasoning (Gan 2020), 
text-to-image synthesis (Cheng 2020), and vision-language pre-training (Yu, Chen, and Li 
2020). 

For the task of image classification (Section 6.2), one of the major restrictions is that a 
model can only predict a label from the discrete pre-defined set of labels it trained on. CLIP 
(Radford, Kim et al. 2021) proposes an alternative approach that relies on image captions to 
enable zero-shot transfer to any possible set of labels. Given an image with a set of labels 
(e.g., (dog, cat, . . . , house }), CLIP predicts the label that maximizes the probability that the 
image is captioned with a prompt similar to “A photo of a {label}”. Section 5.4.7 discusses 
the training aspect of CLIP, which collects 400 million text-image pairs and uses contrastive 
learning to determine how likely it is for an image to be paired with a caption. 

Remarkably, without having seen or fine-tuned to many popular image classification 
benchmarks (e.g., ImageNet, Caltech 101), CLIP can outperform independently fine-tuned 
ResNet-50 models supervised on each specific dataset. Moreover, compared to state-of- 
the-art classification models, CLIP’s zero-shot generalization is significantly more robust to 
dataset distribution shifts, performing well on each of ImageNet Sketch (Wang, Ge et al. 
2019), ImageNetV2 (Recht, Roelofs et al. 2019), and ImageNet-R (Hendrycks, Basart et 
al. 2020), without being specifically trained on any of them. In fact, Goh, Cammarata et 
al. (2021) found that CLIP units responded similarly with concepts presented in different 
modalities (e.g., an image of Spiderman, text of the word spider, and a drawing of Spider- 
man). Figure 6.47 shows the adversarial typographic attack they discovered that could fool 
CLIP. By simply placing a handwritten class label (e.g., iPod) on a real-world object (e.g., 
Apple), CLIP often predicted the class written on the label. 

As with other areas of visual recognition and learning-based systems, datasets have played 
an important role in the development of vision and language systems. Some widely used 


datasets of images with captions include Conceptual Captions (Sharma, Ding et al. 2018), 
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ES 


Girl feeding elephant 
Man taking picture 
Huts on a hillside 
A man taking a picture. 
Flip flops on the ground 
Hillside with water below 
Elephants interacting with people 
Young gir in glasses with backpack 
Elephant that could carry people 
Le An elephant trunk taking two bananas. 
A bush next to a river. 
People watching elephants eating 
A woman wearing glasses. 
A bag 
Glasses on the hair. 
The elephant with a seat on top 
A woman with a purple dress. 
A pair of pink flip flops. 
A handle of bananas. 
|» Tree near the water 
A blue short, 
Small houses on the hillside 
A woman feeding an elephant 
A woman wearing a white shirt and shorts 


Aman taking a picture 


object detection 


object attributes 


Ta 


A man wearing an orange shirt 

An elephant taking food from a woman 
A woman wearing a brown shirt 

A woman wearing purple clothes 

A man wearing blue flip flops 

Man taking a photo of the elephants 
Blue flip flop sandals 

The girl's white and black handbag 

The girl is feeding the elephant 

The nearby river 

A woman wearing a brown t shirt 
Elephant's trunk grabbing the food 

The lady wearing a purple outfit 

A young Asian woman wearing glasses 
Elephants trunk being touched by a hand 
A man taking a picture holding a camera 
Elephant with carrier on it's back 
Woman with sunglasses on her head 
A body of water 

Small buildings surrounded by trees 
Woman wearing a purple dress 

Two people near elephants 

A man wearing a hat 

A woman wearing glasses 

Leaves on the ground 


object classification scene classification fine-grained recognition action recognition 


Q: What is the 


Q: How many people are Q: What is the most Q: What animal is the Q: Where was the picture Q: What kind of boat is 
wearing a lettered, valuable device in this balloon modelled taken? the far left blue boat? snowboarder doing? 
zip-up red jacket? room? after? 
A: Just one. A: The television. A: Blue whale. A: At the beach. A: Sail boat. A: Jumping. 
text detection spatial reasoning event understanding common sense person identification facial expressions 


Q: When was the bridge 
built? 
A: 1932. 


Q: Where is the American 
flag? 
A: Behind president 
Reagan. 


Q: What holiday is being 
celebrated? 
A: Fourth of July. 


Q: What expression is on 
most people's faces? 
A: They are smiling. 


moving? 
A: The wind is blowing. 


(c) 


A: Derek Jeter. 


Figure 6.48 Images and data from the Visual Genome dataset (Krishna, Zhu et al. 2017) © 
2017 Springer. (a) An example image with its region descriptors. (b) Each region has a graph 
representation of objects, attributes, and pairwise relationships, which are combined into a 
scene graph where all the objects are grounded to the image, and also associated questions 
and answers. (c) Some sample question and answer pairs, which cover a spectrum of visual 
tasks from recognition to high-level reasoning. 
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6.6 Vision and language 
Name/URL Metadata Contents/Reference 
Flickr30k (Entities) Image captions (grounded) 30k images (+ bounding boxes) 


https://shannon.cs.illinois.edu/DenotationGraph 

http://bryanplummer.com/Flickr30kEntities 
COCO Captions 

https://cocodataset.org/#captions-2015 


Whole image captions 

Conceptual Captions Whole image captions 
https://ai.google.com/research/ConceptualCaptions 

YFCC100M Flickr metadata 
http://projects.dfki.uni-kl.de/yfcc 100m 

Visual Genome Dense annotations 


https://visualgenome.org 


VQA v2.0 Question/answer pairs 
https://visualqa.org 

VCR Multiple choice questions 
https://visualcommonsense.com 

GQA Compositional QA 
https://visualreasoning.net 

VisDial Dialogs for chatbot 


https://visualdialog.org 


Young, Lai et al. (2014) 
Plummer, Wang et al. (2017) 
1.5M captions, 330k images 
Chen, Fang et al. (2015) 
3.3M image caption pairs 
Sharma, Ding et al. (2018) 
100M images with metadata 
Thomee, Shamma et al. (2016) 
108k images with region graphs 
Krishna, Zhu et al. (2017) 
265k images 
Goyal, Khot et al. (2017) 
110k movie clips, 290k QAs 
Zellers, Bisk et al. (2019) 
22M questions on Visual Genome 
Hudson and Manning (2019) 
120k COCO images + dialogs 
Das, Kottur et al. (2017) 


Table 6.4 


Image datasets for vision and language research. 


the UIUC Pascal Sentence Dataset (Farhadi, Hejrati et al. 2010), the SBU Captioned Photo 
Dataset (Ordonez, Kulkarni, and Berg 2011), Flickr30k (Young, Lai et al. 2014), COCO 
Captions (Chen, Fang et al. 2015), and their extensions to 50 sentences per image (Vedantam, 
Lawrence Zitnick, and Parikh 2015) (see Table 6.4). More densely annotated datasets such 
as Visual Genome (Krishna, Zhu et al. 2017) describe different sub-regions of an image 
with their own phrases, i.e., provide dense captioning, as shown in Figure 6.48. YFCC100M 
(Thomee, Shamma et al. 2016) contains around 100M images from Flickr, but it only includes 
the raw user uploaded metadata for each image, such as the title, time of upload, description, 
tags, and (optionally) the location of the image. 


Metrics for measuring sentence similarity also play an important role in the development 
of image captioning and other vision and language systems. Some widely used metrics in- 
clude BLEU: BiLingual Evaluation Understudy (Papineni, Roukos et al. 2002), ROUGE: Re- 
call Oriented Understudy of Gisting Evaluation (Lin 2004), METEOR: Metric for Evaluation 
of Translation with Explicit ORdering (Banerjee and Lavie 2005), CIDEr: Consensus-based 
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TEXT PROMPT TEXT PROMPT 
an illustration of a baby daikon radish in a tutu walking a dog a store front that has the word ‘openai’ written on it [...] 
/ Ti 

> Ses — = 2 


TEXT PROMPT 
an armchair in the shape of an avocado [...] 


TY 


Figure 6.49 Qualitative text-to-image generation results from DALL-E, showing a wide 


range of generalization abilities ORamesh, Pavlov et al. (2021). The bottom right example 
provides a partially complete image prompt of a cat, along with text, and has the model fill in 
the rest of the image. The other three examples only start with the text prompt as input, with 


the model generating the entire image. 


Image Description Evaluation (Vedantam, Lawrence Zitnick, and Parikh 2015), and SPICE: 


Semantic Propositional Image Caption Evaluation (Anderson, Fernando et al. 2016).7’ 


Text-to-image generation 


The task of text-to-image generation is the inverse of visual captioning, i.e., given a text 
prompt, generate the image. Since images are represented in such high dimensionality, gen- 
erating them to look coherent has historically been difficult. Generating images from a text 
prompt can be thought of as a generalization of generating images from a small set of class la- 
bels (Section 5.5.4). Since there is a near-infinite number of possible text prompts, successful 
models must be able to generalize from the relatively small fraction seen during training. 

Early work on this task from Mansimov, Parisotto et al. (2016) used an RNN to iteratively 
draw an image from scratch. Their results showed some resemblance to the text prompts, 
although the generated images were quite blurred. The following year, Reed, Akata et al. 
(2016) applied a GAN to the problem, where unseen text prompts began to show promising 
results. Their generated images were relatively small (64 x 64), which was improved in later 
papers, which often first generated a small-scale image and then conditioned on that image 
and the text input to generate a higher-resolution image (Zhang, Xu et al. 2017, 2018; Xu, 
Zhang et al. 2018; Li, Qi et al. 2019). 

DALL-E (Ramesh, Pavlov ef al. 2021) uses orders of magnitude of more data (250 million 


2 See https://www.cs.toronto.edu/~fidler/slides/2017/CSC2539/Kaustav-slides.pdf. 
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text-image pairs on the internet) and compute to achieve astonishing qualitative results (Fig- 
ure 6.49).2% Their approach produces promising results for generalizing beyond training data, 
even compositionally piecing together objects that are not often related (e.g., an armchair and 
an avocado), producing many styles (e.g., painting, cartoon, charcoal drawings), and working 
reasonably well with difficult objects (e.g., mirrors or text). 

The model for DALL-E consists of two components: a VQ-VAE-2 (Section 5.5.4) and a 
decoder transformer (Section 5.5.3). The text is tokenized into 256 tokens, each of which is 
one of 16,384 possible vectors using a BPE-encoding (Sennrich, Haddow, and Birch 2015). 
The VQ-VAE-2 uses a codebook of size 8,192 (significantly larger than the codebook of size 
512 used in the original VQ-VAE-2 paper) to compress images as a 32 x 32 grid of vector 
tokens. At inference time, DALL-E uses a transformer decoder, which starts with the 256 
text tokens to autoregressively predict the 32 x 32 grid of image tokens. Given such a grid, 
the VQ-VAE-2 is able to use its decoder to generate the final RGB image of size 256 x 256. 
To achieve better empirical results, DALL-E generates 512 image candidates and reranks 
them using CLIP (Radford, Kim et al. 2021), which determines how likely a given caption is 
associated with a given image. 

An intriguing extension of DALL-E is to use the VQ-VAE-2 encoder to predict a subset 
of the compressed image tokens. For instance, suppose we are given a text input and an 
image. The text input can be tokenized into its 256 tokens, and one can obtain the 32 x 
32 image tokens using the VQ-VAE-2 encoder. If we then discard the bottom half of the 
image tokens, the transformer decoder can be used to autoregressively predict which tokens 
might be there. These tokens, along with the non-discarded ones from the original image, can 
be passed into the VQ-VAE-2 decoder to produce a completed image. Figure 6.49 (bottom 
right) shows how such a text and partial image prompt can be used for applications such as 


image-to-image translation (Section 5.5.4). 


Visual Question Answering and Reasoning 


Image and video captioning are useful tasks that bring us closer to building artificially in- 
telligent systems, as they demonstrate the ability to put together visual cues such as object 
identities, attributes, and actions. However, it remains unclear if the system has understood 
the scene at a deeper level and if it can reason about the constituent pieces and how they fit 
together. 

To address these concerns, researchers have been building visual question answering 


(VQA) systems, which require the vision algorithm to answer open-ended questions about 


2 Play with the results at https://openai.com/blog/dall-e. 
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the image, such as the ones shown in Figure 6.48c. A lot of this work started with the cre- 
ation of the Visual Question Answering (VQA) dataset (Antol, Agrawal et al. 2015), which 
spurred a large amount of subsequent research. The following year, VQA v2.0 improved 
this dataset by creating a balanced set of image pairs, where each question had different an- 
swers in the two images (Goyal, Khot et al. 2017).2? This dataset was further extended to 
reduce the influence of prior assumptions and data distributions and to encourage answers to 


be grounded in the images (Agrawal, Batra et al. 2018). 


Since then, many additional VQA datasets have been created. These include the VCR 
dataset for visual commonsense reasoning (Zellers, Bisk et al. 2019) and the GQA dataset and 
metrics for evaluating visual reasoning and compositional question answering (Hudson and 
Manning 2019), which is built on top of the information about objects, attributes, and relations 
provided through the Visual Genome scene graphs (Krishna, Zhu et al. 2017). A discussion 
of these and other datasets for VQA can be found in the CVPR 2020 tutorial by Gan (2020), 
including datasets that test visual grounding and referring expression comprehension, visual 
entailment, using external knowledge, reading text, answering sub-questions, and using logic. 


Some of these datasets are summarized in Table 6.4. 


As with image and video captioning, VQA systems use various flavors of attention to 
associate pixel regions with semantic concepts (Yang, He et al. 2016). However, instead of 
using sequence models such as RNNs, LSTMs, or transformers to generate text, the natural 
language question is first parsed to produce an encoding that is then fused with the image 


embedding to generate the desired answer. 


The image semantic features can either be computed on a coarse grid, or a “bottom-up” 
object detector can be combined with a “top-down” attention mechanism to provide feature 
weightings (Anderson, He et al. 2018). In recent years, the pendulum has swung back and 
forth between techniques that use bottom-up regions and gridded feature descriptors, with 
two of the recent best-performing algorithms going back to the simpler (and much faster) 
gridded approach (Jiang, Misra et al. 2020; Huang, Zeng et al. 2020). The CVPR 2020 
tutorial by Gan (2020) discusses these and dozens of other VQA systems as well as their 
subcomponents, such as multimodal fusion variants (bilinear pooling, alignment, relational 
reasoning), neural module networks, robust VQA, and multimodal pre-training, The survey 
by Mogadala, Kalimuthu, and Klakow (2021) and the annual VQA Challeng workshop (Shri- 
vastava, Hudson et al. 2020) are also excellent sources of additional information. And if you 
would like to test out the current state of VQA systems, you can upload your own image to 
https://vqa.cloudcv.org and ask the system your own questions. 


https://visualga.org 
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Visual Dialog. An even more challenging version of VQA is visual dialog, where a chatbot 
is given an image and asked to answer open-ended questions about the image while also 
referring to previous elements of the conversation. The VisDial dataset was the earliest to be 
widely used for this task (Das, Kottur et al. 2017).% You can find pointers to systems that 
have been developed for this task at the Visual Dialog workshop and challenge (Shrivastava, 
Hudson et al. 2020). There's also a chatbot at https://visualchatbot.cloudcv.org where you 
can upload your own image and start a conversation, which can sometimes lead to humorous 
(or weird) outcomes (Shane 2019). 


Vision-language pre-training. As with many other recognition tasks, pre-training has had 
some dramatic success in the last few years, with systems such as ViLBERT (Lu, Batra ef 
al. 2019), Oscar (Li, Yin et al. 2020), and many other systems described in the CVPR 2020 


tutorial on self-supervised learning for vision-and-language (Yu, Chen, and Li 2020). 


6.7 Additional reading 


Unlike machine learning or deep learning, there are no recent textbooks or surveys devoted 
specifically to the general topics of image recognition and scene understanding. Some ear- 
lier surveys (Pinz 2005; Andreopoulos and Tsotsos 2013) and collections of papers (Ponce, 
Hebert et al. 2006; Dickinson, Leonardis et al. 2007) review the “classic” (pre-deep learning) 
approaches, but given the tremendous changes in the last decade, many of these techniques 
are no longer used. Currently, some of the best sources for the latest material, in addition 
to this chapter and university computer vision courses, are tutorials at the major vision con- 
ferences such as ICCV (Xie, Girshick et al. 2019), CVPR (Girshick, Kirillov et al. 2020), 
and ECCV (Xie, Girshick et al. 2020). Image recognition datasets such as those listed in 
Tables 6.1-6.4 that maintain active leaderboards can also be a good source for recent papers. 

Algorithms for instance recognition, i.e., the detection of static manufactured objects that 
only vary slightly in appearance but may vary in 3D pose, are still often based on detecting 
2D points of interest and describing them using viewpoint-invariant descriptors, as discussed 
in Chapter 7 and (Lowe 2004), Rothganger, Lazebnik et al. (2006), and Gordon and Lowe 
(2006). In more recent years, attention has shifted to the more challenging problem of in- 
stance retrieval (also known as content-based image retrieval), in which the number of im- 
ages being searched can be very large (Sivic and Zisserman 2009). Section 7.1.4 in the next 
chapter reviews such techniques, as does the survey in (Zheng, Yang, and Tian 2018). This 


topic is also related to visual similarity search (Bell and Bala 2015; Arandjelovic, Gronat et 


30https://visualdialog.org 
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al. 2016; Song, Xiang et al. 2016; Gordo, Almazán et al. 2017; Rawat and Wang 2017; Bell, 
Liu et al. 2020), which was covered in Section 6.2.3. 

A number of surveys, collections of papers, and course notes have been written on the 
topic of feature-based whole image (single-object) category recognition (Pinz 2005; Ponce, 
Hebert et al. 2006; Dickinson, Leonardis et al. 2007; Fei-Fei, Fergus, and Torralba 2009). 
Some of these papers use a bag of words or keypoints (Csurka, Dance et al. 2004; Lazebnik, 
Schmid, and Ponce 2006; Csurka, Dance et al. 2006; Grauman and Darrell 2007b; Zhang, 
Marszalek et al. 2007; Boiman, Shechtman, and Irani 2008; Ferencz, Learned-Miller, and 
Malik 2008). Other papers recognize objects based on their contours, e.g., using shape con- 
texts (Belongie, Malik, and Puzicha 2002) or other techniques (Shotton, Blake, and Cipolla 
2005; Opelt, Pinz, and Zisserman 2006; Ferrari, Tuytelaars, and Van Gool 2006a). 

Many object recognition algorithms use part-based decompositions to provide greater in- 
variance to articulation and pose. Early algorithms focused on the relative positions of the 
parts (Fischler and Elschlager 1973; Kanade 1977; Yuille 1991) while later algorithms used 
more sophisticated models of appearance (Felzenszwalb and Huttenlocher 2005; Fergus, Per- 
ona, and Zisserman 2007; Felzenszwalb, McAllester, and Ramanan 2008). Good overviews 
on part-based models for recognition can be found in the course notes by Fergus (2009). 
Carneiro and Lowe (2006) discuss a number of graphical models used for part-based recog- 
nition, which include trees and stars, k-fans, and constellations. 

Classical recognition algorithms often used scene context as part of their recognition strat- 
egy. Representative papers in this area include Torralba (2003), Torralba, Murphy et al. 
(2003), Rabinovich, Vedaldi et al. (2007), Russell, Torralba et al. (2007), Sudderth, Torralba 
et al. (2008), and Divvala, Hoiem et al. (2009). Machine learning also became a key compo- 
nent of classical object detection and recognition algorithms (Felzenszwalb, McAllester, and 
Ramanan 2008; Sivic, Russell et al. 2008), as did exploiting large human-labeled databases 
(Russell, Torralba et al. 2007; Torralba, Freeman, and Fergus 2008). 

The breakthrough success of the “AlexNet” SuperVision system of Krizhevsky, Sutskever, 
and Hinton (2012) shifted the focus in category recognition research from feature-based ap- 
proaches to deep neural networks. The rapid improvement in recognition accuracy, captured 
in Figure 5.40 and described in more detail in Section 5.4.3 has been driven to a large de- 
gree by deeper networks and better training algorithms, and also in part by larger (unlabeled) 
training datasets (Section 5.4.7). 

More specialized recognition systems such as those for recognizing faces underwent a 
similar evolution. While some of the earliest approaches to face recognition involved find- 
ing the distinctive image features and measuring the distances between them (Fischler and 


Elschlager 1973; Kanade 1977; Yuille 1991), later approaches relied on comparing gray- 
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level images, often projected onto lower dimensional subspaces (Turk and Pentland 1991; 
Belhumeur, Hespanha, and Kriegman 1997; Heisele, Ho et al. 2003) or local binary patterns 
(Ahonen, Hadid, and Pietikáinen 2006). A variety of shape and pose deformation models 
were also developed (Beymer 1996; Vetter and Poggio 1997), including Active Shape Mod- 
els (Cootes, Cooper et al. 1995), 3D Morphable Models (Blanz and Vetter 1999; Egger, Smith 
et al. 2020), and Active Appearance Models (Cootes, Edwards, and Taylor 2001; Matthews 
and Baker 2004; Ramnath, Koterba ef al. 2008). Additional information about classic face 
recognition algorithms can be found in a number of surveys and books on this topic (Chel- 
lappa, Wilson, and Sirohey 1995; Zhao, Chellappa et al. 2003; Li and Jain 2005). 


The concept of shape models for frontalization continued to be used as the community 
shifted to deep neural network approaches (Taigman, Yang et al. 2014). Some more recent 
deep face recognizers, however, omit the frontalization stage and instead use data augmen- 
tation to create synthetic inputs with a larger variety of poses (Schroff, Kalenichenko, and 
Philbin 2015; Parkhi, Vedaldi, and Zisserman 2015). Masi, Wu et al. (2018) provide an ex- 
cellent tutorial and survey on deep face recognition, including a list of widely used training 
and testing datasets, a discussion of frontalization and dataset augmentation, and a section on 


training losses. 


As the problem of whole-image (single object) category recognition became more “solved”, 
attention shifted to multiple object delineation and labeling, i.e., object detection. Object 
detection was originally studied in the context of specific categories such as faces, pedes- 
trians, cars, etc. Seminal papers in face detection include those by Osuna, Freund, and 
Girosi (1997); Sung and Poggio (1998); Rowley, Baluja, and Kanade (1998); Viola and Jones 
(2004); Heisele, Ho et al. (2003), with Yang, Kriegman, and Ahuja (2002) providing a com- 
prehensive survey of early work in this field. Early work in pedestrian and car detection 
was carried out by Gavrila and Philomin (1999); Gavrila (1999); Papageorgiou and Poggio 
(2000); Schneiderman and Kanade (2004). Subsequent papers include (Mikolajezyk, Schmid, 
and Zisserman 2004; Dalal and Triggs 2005; Leibe, Seemann, and Schiele 2005; Andriluka, 
Roth, and Schiele 2009; Dollar, Belongie, and Perona 2010; Felzenszwalb, Girshick et al. 
2010). 


Modern generic object detectors are typically constructed using a region proposal algo- 
rithm (Uijlings, Van De Sande et al. 2013; Zitnick and Dollar 2014) that then feeds selected 
regions of the image (either as pixels or precomputed neural features) into a multi-way clas- 
sifier, resulting in architectures such as R-CNN (Girshick, Donahue et al. 2014), Fast R-CNN 
(Girshick 2015), Faster R-CCNN (Ren, He et al. 2015), and FPN (Lin, Dollar et al. 2017). 
An alternative to this two-stage approach is a single-stage network, which uses a single net- 


work to output detections at a variety of locations. Examples of such architectures include 
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SSD (Liu, Anguelov et al. 2016), RetinaNet (Lin, Goyal et al. 2017), and YOLO (Redmon, 
Divvala et al. 2016; Redmon and Farhadi 2017, 2018; Bochkovskiy, Wang, and Liao 2020). 
These and more recent convolutional object detectors are described in the recent survey by 
Jiao, Zhang et al. (2019). 


While object detection can be sufficient in many computer vision applications such as 
counting cars or pedestrians or even describing images, a detailed pixel-accurate labeling 
can be potentially even more useful, e.g., for photo editing. This kind of labeling comes in 
several flavors, including semantic segmentation (what stuff is this?), instance segmentation 
(which countable object is this?), panoptic segmentation (what stuff or object is 1t?). One 
early approach to this problem was to pre-segment the image into pieces and then match 
these pieces to portions of the model (Mori, Ren ef al. 2004; Russell, Efros et al. 2006; 
Borenstein and Ullman 2008; Gu, Lim et al. 2009). Another popular approach was to use 
conditional random fields (Kumar and Hebert 2006; He, Zemel, and Carreira-Perpiñán 2004; 
Winn and Shotton 2006; Rabinovich, Vedaldi et al. 2007; Shotton, Winn et al. 2009). which 
at that time produced some of the best results on the PASCAL VOC segmentation challenge. 
Modern semantic segmentation algorithms use pyramidal fully-convolutional architectures to 
map input pixels to class labels (Long, Shelhamer, and Darrell 2015; Zhao, Shi et al. 2017; 
Xiao, Liu et al. 2018; Wang, Sun et al. 2020). 


The more challenging task of instance segmentation, where each distinct object gets its 
own unique label, is usually tackled using a combination of object detectors and per-object 
segmentation, as exemplified in the seminal Mask R-CNN paper by He, Gkioxari et al. 
(2017). Follow-on work uses more sophisticated backbone architectures (Liu, Qi et al. 2018; 
Chen, Pang et al. 2019). Two recent workshops that highlight the latest results in this area 
are the COCO + LVIS Joint Recognition Challenge (Kirillov, Lin et al. 2020) and the Robust 
Vision Challenge (Zendel et al. 2020). 


Putting semantic and instance segmentation together has long been a goal of semantic 
scene understanding (Yao, Fidler, and Urtasun 2012; Tighe and Lazebnik 2013; Tu, Chen et 
al. 2005). Doing this on a per-pixel level results in a panoptic segmentation of the scene, 
where all of the objects are correctly segmented and the remaining stuff is correctly labeled 
(Kirillov, He et al. 2019; Kirillov, Girshick et al. 2019). The COCO dataset has now been 
extended to include a panoptic segmentation task, on which some recent results can be found 
in the ECCV 2020 workshop on this topic (Kirillov, Lin et al. 2020). 

Research in video understanding, or more specifically human activity recognition, dates 
back to the 1990s; some good surveys include (Aggarwal and Cai 1999; Poppe 2010; Aggar- 
wal and Ryoo 2011; Weinland, Ronfard, and Boyer 2011). In the last decade, video under- 
standing techniques shifted to using deep networks (Ji, Xu et al. 2013; Karpathy, Toderici ef 
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al. 2014; Simonyan and Zisserman 2014a; Donahue, Hendricks et al. 2015; Tran, Bourdev et 
al. 2015; Feichtenhofer, Pinz, and Zisserman 2016; Carreira and Zisserman 2017; Tran, Wang 
et al. 2019; Wu, Feichtenhofer et al. 2019; Feichtenhofer, Fan et al. 2019). Some widely used 
datasets used for evaluating these algorithms are summarized in Table 6.3. 

While associating words with images has been studied for a while (Duygulu, Barnard 
et al. 2002), sustained research into describing images with captions and complete sentences 
started in the early 2010s (Farhadi, Hejrati ef al. 2010; Kulkarni, Premraj et al. 2013). The last 
decade has seen a rapid increase in performance and capabilities of such systems (Mogadala, 
Kalimuthu, and Klakow 2021; Gan, Yu et al. 2020). The first sub-problem to be widely 
studied was image captioning (Donahue, Hendricks et al. 2015; Fang, Gupta et al. 2015; 
Karpathy and Fei-Fei 2015; Vinyals, Toshev et al. 2015; Xu, Ba et al. 2015; Devlin, Gupta 
et al. 2015), with later systems using attention mechanisms (Anderson, He et al. 2018; Lu, 
Yang et al. 2018). More recently, researchers have developed systems for visual question 
answering (Antol, Agrawal et al. 2015) and visual commonsense reasoning (Zellers, Bisk et 
al. 2019). 

The CVPR 2020 tutorial on recent advances in visual captioning (Zhou 2020) summarizes 
over two dozen related papers from the last five years, including papers that use Transformers 
to do the captioning. It also covers video description and dense video captioning (Aafaq, 
Mian et al. 2019; Zhou, Kalantidis et al. 2019) and vision-language pre-training (Sun, Myers 
et al. 2019; Zhou, Palangi et al. 2020; Li, Yin et al. 2020). The tutorial also has lectures on 
visual question answering and reasoning (Gan 2020), text-to-image synthesis (Cheng 2020), 
and vision-language pre-training (Yu, Chen, and Li 2020). 


6.8 Exercises 


Ex 6.1: Pre-trained recognition networks. Find a pre-trained network for image classifi- 
cation, segmentation, or some other task such as face recognition or pedestrian detection. 
After running the network, can you characterize the most common kinds of errors the 
network is making? Create a “confusion matrix” indicating which categories get classified as 
other categories. Now try the network on your own data, either from a web search or from 
your personal photo collection. Are there surprising results? 
My own favorite code to try is Detectron2,*! which I used to generate the panoptic seg- 


mentation results shown in Figure 6.39. 


31Click on the “Colab Notebook” link at https://github.com/facebookresearch/detectron2 and then edit the input 


image URL to try your own. 
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Ex 6.2: Re-training recognition networks. After analyzing the performance of your pre- 
trained network, try re-training it on the original dataset on which it was trained, but with 
modified parameters (numbers of layers, channels, training parameters) or with additional 
examples. Can you get the network to perform more to you liking? 

Many of the online tutorials, such as the Detectron2 Collab notebook mentioned above, 
come with instructions on how to re-train the network from scratch on a different dataset. 
Can you create your own dataset, e.g., using a web search and figure out how to label the 
examples? A low effort (but not very accurate) way is to trust the results of the web search. 
Russakovsky, Deng et al. (2015), Kovashka, Russakovsky et al. (2016), and other papers on 
image datasets discuss the challenges in obtaining accurate labels. 

Train your network, try to optimize its architecture, and report on the challenges you faced 


and discoveries you made. 
Note: the following exercises were suggested by Matt Deitke. 


Ex 6.3: Image perturbations. Download either ImageNet or Imagenette.** Now, perturb 
each image by adding a small square to the top left of the image, where the color of the square 
is unique for each label, as shown in the following figure: 


(a) cassette player (b) golf ball (c) English Springer 


Using any image classification model, e.g., ResNet, EfficientNet, or ViT, train the model 
from scratch on the perturbed images. Does the model overfit to the color of the square and 
ignore the rest of the image? When evaluating the model on the training and validation data, 
try adversarially swapping colors between different labels. 


Ex 6.4: Image normalization. Using the same dataset downloaded for the previous exer- 
cise, take a ViT model and remove all the intermediate layer normalization operations. Are 
you able to train the network? Using techniques in Li, Xu et al. (2018), how do the plots of 


the loss landscape appear with and without the intermediate layer normalization operations? 


Ex 6.5: Semantic segmentation. Explain the differences between instance segmentation, 
semantic segmentation, and panoptic segmentation. For each type of segmentation, can it be 
post-processed to obtain the other kinds of segmentation? 


32 


Imagenette, https://github.com/fastai/imagenette, is a smaller 10-class subset of ImageNet that is easier to use 
with limited computing resources. . 
33- You may find the PyTorch Image Models at https://github.com/rwightman/pytorch-image- models useful. 
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Ex 6.6: Class encoding. Categorical inputs to a neural network, such as a word or object, 
can be encoded with one-hot encoded vector.** However, it is common to pass the one-hot 
encoded vector through an embedding matrix, where the output is then passed into the neural 
network loss function. What are the advantages of vector embedding over using one-hot 


encoding? 


Ex 6.7: Object detection. For object detection, how do the number of parameters for DETR, 
Faster-RCNN, and YOLOv4 compare? Try training each of them on MS COCO. Which one 
tends to train the slowest? How long does it take each model to evaluate a single image at 
inference time? 


Ex 6.8: Image classification vs. description. For image classification, list at least two sig- 


nificant differences between using categorical labels and natural language descriptions. 


Ex 6.9: ImageNet Sketch. Try taking several pre-trained models on ImageNet and evalu- 
ating them, without any fine-tuning, on ImageNet Sketch (Wang, Ge et al. 2019). For each of 


these models, to what extent does the performance drop due to the shift in distribution? 


Ex 6.10: Self-supervised learning. Provide examples of self-supervised learning pretext 


tasks for each of the following data types: static images, videos, and vision-and-language. 


Ex 6.11: Video understanding. For many video understanding tasks, we may be interested 
in tracking an object through time. Why might this be preferred to making predictions inde- 


pendently for each frame? Assume that inference speed is not a problem. 


Ex 6.12: Fine-tuning a new head. Take the backbone of a network trained for object clas- 
sification and fine-tune it for object detection with a variant of YOLO. Why might it be 
desirable to freeze the early layers of the network? 


Ex 6.13: Movie understanding. Currently, most video understanding networks, such as 
those discussed in this chapter, tend to only deal with short video clips as input. What modifi- 


cations might be necessary in order to operate over longer sequences such as an entire movie? 


34With a categorical variable, one-hot encoding is used to represent which label is chosen, i.e., when a label is 


chosen, its entry in the vector is 1 with all other entries being 0. 
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Figure 7.1 Feature detectors and descriptors can be used to analyze, describe and match 
images: (a) point-like interest operators (Brown, Szeliski, and Winder 2005) O 2005 IEEE; 
(b) GLOH descriptor (Mikolajczyk and Schmid 2005); (c) edges (Elder and Goldberg 2001) 
© 2001 IEEE; (d) straight lines (Sinha, Steedly et al. 2008) O 2008 ACM; (e) graph-based 
merging (Felzenszwalb and Huttenlocher 2004) © 2004 Springer; (f) mean shift (Comaniciu 
and Meer 2002) O 2002 IEEE. 
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Feature detection and matching are an essential component of many computer vision appli- 
cations. Consider the two pairs of images shown in Figure 7.2. For the first pair, we may 
wish to align the two images so that they can be seamlessly stitched into a composite mosaic 
(Section 8.2). For the second pair, we may wish to establish a dense set of correspondences 
so that a 3D model can be constructed or an in-between view can be generated (Chapter 12). 
In either case, what kinds of features should you detect and then match to establish such an 
alignment or set of correspondences? Think about this for a few moments before reading on. 

The first kind of feature that you may notice are specific locations in the images, such as 
mountain peaks, building corners, doorways, or interestingly shaped patches of snow. These 
kinds of localized features are often called keypoint features or interest points (or even cor- 
ners) and are often described by the appearance of pixel patches surrounding the point loca- 
tion (Section 7.1). Another class of important features are edges, e.g., the profile of mountains 
against the sky (Section 7.2). These kinds of features can be matched based on their orienta- 
tion and local appearance (edge profiles) and can also be good indicators of object boundaries 
and occlusion events in image sequences. Edges can be grouped into longer curves and con- 
tours, which can then be tracked (Section 7.3). They can also be grouped into straight line 
segments, which can be directly matched or analyzed to find vanishing points and hence in- 
ternal and external camera parameters (Section 7.4). 

In this chapter, we describe some practical approaches to detecting such features and 
also discuss how feature correspondences can be established across different images. Point 
features are now used in such a wide variety of applications that it is good practice to read 
and implement some of the algorithms from Section 7.1. Edges and lines provide informa- 
tion that is complementary to both keypoint and region-based descriptors and are well suited 
to describing the boundaries of manufactured objects. These alternative descriptors, while 
extremely useful, can be skipped in a short introductory course. 

The last part of this chapter (Section 7.5) discusses bottom-up non-semantic segmentation 
techniques. While these were once widely used as essential components of both recognition 
and matching algorithms, they have mostly been supplanted by the semantic segmentation 
techniques we studied in Section 6.4. They are still used occasionally to group pixels together 


for faster or more reliable matching. 


7.1 Points and patches 


Point features can be used to find a sparse set of corresponding locations in different im- 
ages, often as a precursor to computing camera pose (Chapter 11), which is a prerequisite for 


computing a denser set of correspondences using stereo matching (Chapter 12). Such cor- 


420 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


i 


Figure 7.2 Two pairs of images to be matched. What kinds of features might one use to 


establish a set of correspondences between these images? 


respondences can also be used to align different images, e.g., when stitching image mosaics 
(Section 8.2) or high dynamic range images (Section 10.2), or performing video stabilization 
(Section 9.2.1). They are also used extensively to perform object instance recognition (Sec- 
tion 6.1). A key advantage of keypoints is that they permit matching even in the presence of 
clutter (occlusion) and large scale and orientation changes. 


Feature-based correspondence techniques have been used since the early days of stereo 
matching (Hannah 1974; Moravec 1983; Hannah 1988) and subsequently gained popularity 
for image-stitching applications (Zoghlami, Faugeras, and Deriche 1997; Brown and Lowe 
2007) as well as fully automated 3D modeling (Beardsley, Torr, and Zisserman 1996; Schaf- 
falitzky and Zisserman 2002; Brown and Lowe 2005; Snavely, Seitz, and Szeliski 2006). 


There are two main approaches to finding feature points and their correspondences. The 
first is to find features in one image that can be accurately tracked using a local search tech- 
nique, such as correlation or least squares (Section 7.1.5). The second is to independently 
detect features in all the images under consideration and then match features based on their 
local appearance (Section 7.1.3). The former approach is more suitable when images are 
taken from nearby viewpoints or in rapid succession (e.g., video sequences), while the lat- 
ter is more suitable when a large amount of motion or appearance change is expected, e.g., 
in stitching together panoramas (Brown and Lowe 2007), establishing correspondences in 


wide baseline stereo (Schaffalitzky and Zisserman 2002), or performing object recognition 
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Figure 7.3 Image pairs with extracted patches below. Notice how some patches can be 


localized or matched with higher accuracy than others. 


(Fergus, Perona, and Zisserman 2007). 


In this section, we split the keypoint detection and matching pipeline into four separate 
stages. During the feature detection (extraction) stage (Section 7.1.1), each image is searched 
for locations that are likely to match well in other images. In the feature description stage 
(Section 7.1.2), each region around detected keypoint locations is converted into a more com- 
pact and stable (invariant) descriptor that can be matched against other descriptors. The 
feature matching stage (Sections 7.1.3 and 7.1.4) efficiently searches for likely matching can- 
didates in other images. The feature tracking stage (Section 7.1.5) is an alternative to the third 
stage that only searches a small neighborhood around each detected feature and is therefore 
more suitable for video processing. 


A wonderful example of all of these stages can be found in David Lowe’s (2004) paper, 
which describes the development and refinement of his Scale Invariant Feature Transform 
(SIFT). Comprehensive descriptions of alternative techniques can be found in a series of sur- 
vey and evaluation papers covering both feature detection (Schmid, Mohr, and Bauckhage 
2000; Mikolajczyk, Tuytelaars et al. 2005; Tuytelaars and Mikolajczyk 2008) and feature de- 
scriptors (Mikolajezyk and Schmid 2005; Balntas, Lenc et al. 2020). Shi and Tomasi (1994) 
and Triggs (2004) also provide nice reviews of classic (pre-neural network) feature detection 
techniques. 
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(a) (b) (c) 


Figure 7.4 Aperture problems for different image patches: (a) stable (“corner-like”) flow; 
(b) classic aperture problem (barber-pole illusion); (c) textureless region. The two images Io 
(yellow) and I, (red) are overlaid. The red vector u indicates the displacement between the 


patch centers and the w(x;) weighting function (patch window) is shown as a dark circle. 


7.1.1 Feature detectors 


How can we find image locations where we can reliably find correspondences with other 
images, i.e., what are good features to track (Shi and Tomasi 1994; Triggs 2004)? Look again 
at the image pair shown in Figure 7.3 and at the three sample patches to see how well they 
might be matched or tracked. As you may notice, textureless patches are nearly impossible 
to localize. Patches with large contrast changes (gradients) are easier to localize, although 
straight line segments at a single orientation suffer from the aperture problem (Horn and 
Schunck 1981; Lucas and Kanade 1981; Anandan 1989), i.e., it is only possible to align 
the patches along the direction normal to the edge direction (Figure 7.4b). Patches with 
gradients in at least two (significantly) different orientations are the easiest to localize, as 
shown schematically in Figure 7.4a. 

These intuitions can be formalized by looking at the simplest possible matching criterion 


for comparing two image patches, i.e., their (weighted) summed square difference, 


Ewssp(u) = Y w(x) [Ti (xi + u) — Io(x:)]?, (7.1) 
where Ip and J; are the two images being compared, u = (u, v) is the displacement vector, 
w(x) is a spatially varying weighting (or window) function, and the summation ¿ is over all 
the pixels in the patch. Note that this is the same formulation we later use to estimate motion 
between complete images (Section 9.1). 

When performing feature detection, we do not know which other image locations the 
feature will end up being matched against. Therefore, we can only compute how stable this 


metric is with respect to small variations in position Au by comparing an image patch against 
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CEA q: 


(b) (c) (d) 


Figure 7.5 Three auto-correlation surfaces Eac(Au) shown as both grayscale images and 
surface plots: (a) The original image is marked with three red crosses to denote where the 
auto-correlation surfaces were computed; (b) this patch is from the flower bed (good unique 
minimum); (c) this patch is from the roof edge (one-dimensional aperture problem); and (d) 
this patch is from the cloud (no good peak). Each grid point in figures b-d is one value of 
Au. 
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itself, which is known as an auto-correlation function or surface 


Eac(Au) = Y > w(x;)[lo(x; + Au) — Lo(x;)]? (7.2) 
(Figure 7.5).! Note how the auto-correlation surface for the textured flower bed (Figure 7.5b 
and the red cross in the lower right quadrant of Figure 7.5a) exhibits a strong minimum, 
indicating that it can be well localized. The correlation surface corresponding to the roof 
edge (Figure 7.5c) has a strong ambiguity along one direction, while the correlation surface 
corresponding to the cloud region (Figure 7.5d) has no stable minimum. 
Using a Taylor Series expansion of the image function To(x; + Au) = Io(x;) + VJo(xi)- 
Au (Lucas and Kanade 1981; Shi and Tomasi 1994), we can approximate the auto-correlation 


surface as 
Eac(Au) = oe (x;)[Zo(x; + Au) — Io(x:)]? (7.3) 
x En w(x;)[Lo(x,) + Vlo(x;,) - Au — Ip(x;)]? (7.4) 
= = Ew w(x;)[VIo(xi) - Au]? (7.5) 
= AUAA (7.6) 
dies Olo 3l 
Vio(xi) = (a By e) (1.7) 


is the image gradient at x;. This gradient can be computed using a variety of techniques 
(Schmid, Mohr, and Bauckhage 2000). The classic “Harris” detector (Harris and Stephens 
1988) uses a [-2 —1 0 1 2] filter, but more modern variants (Schmid, Mohr, and Bauckhage 
2000; Triggs 2004) convolve the image with horizontal and vertical derivatives of a Gaussian 
(typically with o = 1). 


The auto-correlation matrix A can be written as 


2 
A=w*x | fs a ; (7.8) 
Lh I, 


where we have replaced the weighted summations with discrete convolutions with the weight- 
ing kernel w. This matrix can be interpreted as a tensor (multiband) image, where the outer 
products of the gradients VZ are convolved with a weighting function w to provide a per-pixel 


estimate of the local (quadratic) shape of the auto-correlation function. 


'Strictly speaking, a correlation is the product of two patches (3.12); I’m using the term here in a more qualitative 


sense. The weighted sum of squared differences is often called an SSD surface (Section 9.1). 
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direction of the 
slowest change 


Figure 7.6 Uncertainty ellipse corresponding to an eigenvalue analysis of the auto- 


correlation matrix A. 


As first shown by Anandan (1984; 1989) and further discussed in Section 9.1.3 and Equa- 
tion (9.37), the inverse of the matrix A provides a lower bound on the uncertainty in the 
location of a matching patch. It is therefore a useful indicator of which patches can be reli- 
ably matched. The easiest way to visualize and reason about this uncertainty is to perform 
an eigenvalue analysis of the auto-correlation matrix A, which produces two eigenvalues 
(Ao, A1) and two eigenvector directions (Figure 7.6). Since the larger uncertainty depends on 
the smaller eigenvalue, i.e., Ag 1/ ” it makes sense to find maxima in the smaller eigenvalue 
to locate good features to track (Shi and Tomasi 1994). 


Forstner—Harris. While Anandan (1984) and Lucas and Kanade (1981) were the first to 
analyze the uncertainty structure of the auto-correlation matrix, they did so in the context 
of associating certainties with optical flow measurements. Fórstner (1986) and Harris and 
Stephens (1988) were the first to propose using local maxima in rotationally invariant scalar 
measures derived from the auto-correlation matrix to locate keypoints for the purpose of 
sparse feature matching.” Both of these techniques also proposed using a Gaussian weighting 
window instead of the previously used square patches, which makes the detector response 
insensitive to in-plane image rotations. 

The minimum eigenvalue Ag (Shi and Tomasi 1994) is not the only quantity that can be 


used to find keypoints. A simpler quantity, proposed by Harris and Stephens (1988), is 
det(A) — a trace(A)? = AoA — a(ào + 1)? (7.9) 


with a = 0.06. Unlike eigenvalue analysis, this quantity does not require the use of square 


roots and yet is still rotationally invariant and also downweights edge-like features where 


Schmid, Mohr, and Bauckhage (2000) and Triggs (2004) give more detailed historical reviews of feature detec- 


tion algorithms. 
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Figure 7.7 Isocontours of popular keypoint detection functions (Brown, Szeliski, and 
Winder 2004). Each detector looks for points where the eigenvalues \o,1 of A = 
w x VIVI? are both large. 


A1 > Ao. Triggs (2004) suggests using the quantity 
Ao = aA, (7.10) 


(say, with a = 0.05), which also reduces the response at 1D edges, where aliasing errors 
sometimes inflate the smaller eigenvalue. He also shows how the basic 2 x 2 Hessian can be 
extended to parametric motions to detect points that are also accurately localizable in scale 
and rotation. Brown, Szeliski, and Winder (2005), on the other hand, use the harmonic mean, 
det A AoAt 
tA Ag+Ar’ 


(7.11) 


which is a smoother function in the region where Ay = A 1. Figure 7.7 shows isocontours 
of the various interest point operators, from which we can see how the two eigenvalues are 
blended to determine the final interest value. Figure 7.8 shows the resulting interest operator 
responses for the classic Harris detector as well as the difference of Gaussian (DoG) detector 
discussed below. 


Adaptive non-maximal suppression (ANMS). While most feature detectors simply look 
for local maxima in the interest function, this can lead to an uneven distribution of feature 
points across the image, e.g., points will be denser in regions of higher contrast. To mitigate 
this problem, Brown, Szeliski, and Winder (2005) only detect features that are both local 
maxima and whose response value is significantly (10%) greater than that of all of its neigh- 
bors within a radius r (Figure 7.9c—d). They devise an efficient way to associate suppression 


radii with all local maxima by first sorting them by their response strength and then creating 
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(b) 


Figure 7.8 Interest operator responses: (a) Sample image, (b) Harris response, and (c) 
DoG response. The circle sizes and colors indicate the scale at which each interest point was 


detected. Notice how the two detectors tend to respond at complementary locations. 
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(b) Strongest 500 


(c) ANMS 250, r = 24 (d) ANMS 500, r = 16 


Figure 7.9 Adaptive non-maximal suppression (ANMS) (Brown, Szeliski, and Winder 
2005) © 2005 IEEE: The upper two images show the strongest 250 and 500 interest points, 
while the lower two images show the interest points selected with adaptive non-maximal sup- 
pression, along with the corresponding suppression radius r. Note how the latter features 


have a much more uniform spatial distribution across the image. 
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Figure 7.10  Multi-scale oriented patches (MOPS) extracted at five pyramid levels (Brown, 
Szeliski, and Winder 2005) © 2005 IEEE. The boxes show the feature orientation and the 


region from which the descriptor vectors are sampled. 


a second list sorted by decreasing suppression radius (Brown, Szeliski, and Winder 2005). 
Figure 7.9 shows a qualitative comparison of selecting the top n features and using ANMS. 
Note that non-maximal suppression is now also an essential component of DNN-based object 
detectors, as discussed in Section 6.3.3. 


Measuring repeatability. Given the large number of feature detectors that have been de- 
veloped in computer vision, how can we decide which ones to use? Schmid, Mohr, and 
Bauckhage (2000) were the first to propose measuring the repeatability of feature detectors, 
which they define as the frequency with which keypoints detected in one image are found 
within e (say, e = 1.5) pixels of the corresponding location in a transformed image. In their 
paper, they transform their planar images by applying rotations, scale changes, illumination 
changes, viewpoint changes, and adding noise. They also measure the information content 
available at each detected feature point, which they define as the entropy of a set of rotation- 
ally invariant local grayscale descriptors. Among the techniques they survey, they find that 
the improved (Gaussian derivative) version of the Harris operator with og = 1 (scale of the 
derivative Gaussian) and g; = 2 (scale of the integration Gaussian) works best. 


Scale invariance 


In many situations, detecting features at the finest stable scale possible may not be appro- 
priate. For example, when matching images with little high-frequency detail (e.g., clouds), 
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fine-scale features may not exist. 

One solution to the problem is to extract features at a variety of scales, e.g., by performing 
the same operations at multiple resolutions in a pyramid and then matching features at the 
same level. This kind of approach is suitable when the images being matched do not undergo 
large scale changes, e.g., when matching successive aerial images taken from an airplane or 
stitching panoramas taken with a fixed-focal-length camera. Figure 7.10 shows the output of 
one such approach: the multi-scale oriented patch detector of Brown, Szeliski, and Winder 
(2005), for which responses at five different scales are shown. 

However, for most object recognition applications, the scale of the object in the image 
is unknown. Instead of extracting features at many different scales and then matching all of 
them, it is more efficient to extract features that are stable in both location and scale (Lowe 
2004; Mikolajezyk and Schmid 2004). 

Early investigations into scale selection were performed by Lindeberg (1993; 1998b), 
who first proposed using extrema in the Laplacian of Gaussian (LoG) function as interest 
point locations. Based on this work, Lowe (2004) proposed computing a set of sub-octave 
Difference of Gaussian filters (Figure 7.11a), looking for 3D (space+scale) maxima in the re- 
sulting structure (Figure 7.11b), and then computing a sub-pixel space+scale location using a 
quadratic fit (Brown and Lowe 2002). The number of sub-octave levels was determined, after 
careful empirical investigation, to be three, which corresponds to a quarter-octave pyramid, 
which is the same as used by Triggs (2004). 

As with the Harris operator, pixels where there is strong asymmetry in the local curvature 
of the indicator function (in this case, the DoG) are rejected. This is implemented by first 


computing the local Hessian of the difference image D, 


Dex Dy 
H = ml (7.12) 
Dey Dyy 
and then rejecting keypoints for which 
Tr(H)? 
——— > 10. 7.1 
pa” is 


While Lowe’s Scale Invariant Feature Transform (SIFT) performs well in practice, it is not 
based on the same theoretical foundation of maximum spatial stability as the auto-correlation- 
based detectors. (In fact, 1ts detection locations are often complementary to those produced 
by such techniques and can therefore be used in conjunction with these other approaches.) 
In order to add a scale selection mechanism to the Harris corner detector, Mikolajczyk and 
Schmid (2004) evaluate the Laplacian of Gaussian function at each detected Harris point (in 


a multi-scale pyramid) and keep only those points for which the Laplacian is extremal (larger 
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Figure 7.11 Scale-space feature detection using a sub-octave Difference of Gaussian pyra- 
mid (Lowe 2004) © 2004 Springer: (a) Adjacent levels of a sub-octave Gaussian pyramid are 
subtracted to produce Difference of Gaussian images; (b) extrema (maxima and minima) in 


the resulting 3D volume are detected by comparing a pixel to its 26 neighbors. 


or smaller than both its coarser and finer-level values). An optional iterative refinement for 
both scale and position is also proposed and evaluated. Additional examples of scale-invariant 
region detectors are discussed by Mikolajczyk, Tuytelaars et al. (2005) and Tuytelaars and 
Mikolajczyk (2008). 


Rotational invariance and orientation estimation 


In addition to dealing with scale changes, most image matching and object recognition algo- 
rithms need to deal with (at least) in-plane image rotation. One way to deal with this problem 
is to design descriptors that are rotationally invariant (Schmid and Mohr 1997), but such 
descriptors have poor discriminability, i.e. they map different looking patches to the same 
descriptor. 

A better method is to estimate a dominant orientation at each detected keypoint. Once 
the local orientation and scale of a keypoint have been estimated, a scaled and oriented patch 
around the detected point can be extracted and used to form a feature descriptor (Figures 7.10 
and 7.15). 

The simplest possible orientation estimate is the average gradient within a region around 
the keypoint. If a Gaussian weighting function is used (Brown, Szeliski, and Winder 2005), 
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0 21 


angle histogram 


Figure 7.12 A dominant orientation estimate can be computed by creating a histogram of 


Image gradients 


all the gradient orientations (weighted by their magnitudes or after thresholding out small 
gradients) and then finding the significant peaks in this distribution (Lowe 2004) O 2004 
Springer. 


this average gradient is equivalent to a first-order steerable filter (Section 3.2.3), i.e., it can be 
computed using an image convolution with the horizontal and vertical derivatives of Gaus- 
sian filter (Freeman and Adelson 1991). To make this estimate more reliable, it is usually 
preferable to use a larger aggregation window (Gaussian kernel size) than detection window 
(Brown, Szeliski, and Winder 2005). The orientations of the square boxes shown in Fig- 
ure 7.10 were computed using this technique. 

Sometimes, however, the averaged (signed) gradient in a region can be small and therefore 
an unreliable indicator of orientation. A more reliable technique is to look at the histogram 
of orientations computed around the keypoint. Lowe (2004) computes a 36-bin histogram 
of edge orientations weighted by both gradient magnitude and Gaussian distance to the cen- 
ter, finds all peaks within 80% of the global maximum, and then computes a more accurate 


orientation estimate using a three-bin parabolic fit (Figure 7.12). 


Affine invariance 


While scale and rotation invariance are highly desirable, for many applications such as wide 
baseline stereo matching (Pritchett and Zisserman 1998; Schaffalitzky and Zisserman 2002) 
or location recognition (Chum, Philbin et al. 2007), full affine invariance is preferred. Affine- 
invariant detectors not only respond at consistent locations after scale and orientation changes, 
they also respond consistently across affine deformations such as (local) perspective fore- 
shortening (Figure 7.13). In fact, for a small enough patch, any continuous image warping 
can be well approximated by an affine deformation. 


To introduce affine invariance, several authors have proposed fitting an ellipse to the auto- 
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Figure 7.13 Affine region detectors used to match two images taken from dramatically 
different viewpoints (Mikolajczyk and Schmid 2004) O 2004 Springer. 


Figure 7.14 Maximally stable extremal regions (MSERs) extracted and matched from a 
number of images (Matas, Chum et al. 2004) O 2004 Elsevier. 


correlation or Hessian matrix (using eigenvalue analysis) and then using the principal axes 
and ratios of this fit as the affine coordinate frame (Lindeberg and Gárding 1997; Baumberg 
2000; Mikolajezyk and Schmid 2004; Mikolajczyk, Tuytelaars et al. 2005; Tuytelaars and 
Mikolajczyk 2008). 

Another important affine invariant region detector is the maximally stable extremal region 
(MSER) detector developed by Matas, Chum et al. (2004). To detect MSERs, binary regions 
are computed by thresholding the image at all possible gray levels (the technique therefore 
only works for grayscale images). This operation can be performed efficiently by first sorting 
all pixels by gray value and then incrementally adding pixels to each connected component 
as the threshold is changed (Nistér and Stewénius 2008). As the threshold is changed, the 
area of each component (region) is monitored; regions whose rate of change of area with 
respect to the threshold is minimal are defined as maximally stable. Such regions are therefore 
invariant to both affine geometric and photometric (linear bias-gain or smooth monotonic) 
transformations (Figure 7.14). If desired, an affine coordinate frame can be fit to each detected 
region using its moment matrix. 

The area of feature point detection continues to be very active, with papers appearing 
every year at major computer vision conferences. Mikolajczyk, Tuytelaars et al. (2005) and 
Tuytelaars and Mikolajczyk (2008) survey a number of popular (pre-DNN) affine region de- 


tectors and provide experimental comparisons of their invariance to common image transfor- 


7.1 Points and patches 433 


mations such as scaling, rotations, noise, and blur. 


More recent papers published in the last decade include: 


SURF (Bay, Ess et al. 2008), which uses integral images for faster convolutions; 


FAST and FASTER (Rosten, Porter, and Drummond 2010), one of the first learned 


detectors; 


BRISK (Leutenegger, Chli, and Siegwart 2011), which uses a scale-space FAST detec- 
tor together with a bit-string descriptor; 


ORB (Rublee, Rabaud et al. 2011), which adds orientation to FAST; and 


KAZE (Alcantarilla, Bartoli, and Davison 2012) and Accelerated-KAZE (Alcantarilla, 
Nuevo, and Bartoli 2013), which use non-linear diffusion to select the scale for feature 


detection. 


While FAST introduced the idea of machine learning for feature detectors, more recent 


papers use convolutional neural networks to perform the detection. These include: 


Learning covariant feature detectors (Lenc and Vedaldi 2016); 
Learning to assign orientations to feature points (Yi, Verdie et al. 2016); 


LIFT, learned invariant feature transforms (Yi, Trulls et al. 2016), SuperPoint, self- 
supervised interest point detection and description (DeTone, Malisiewicz, and Rabi- 
novich 2018), and LF-Net, learning local features from images (Ono, Trulls et al. 
2018), all three of which jointly optimize the detectors and descriptors in a single 
(multi-head) pipeline; 


AffNet (Mishkin, Radenovic, and Matas 2018), which detects matchable affine-covariant 


regions; 


Key.Net (Barroso-Laguna, Riba et al. 2019), which uses a combination of handcrafted 
and learned CNN features; and 


D2-Net (Dusmanu, Rocco ef al. 2019), R2D2 (Revaud, Weinzaepfel et al. 2019), and 
D2D (Tian, Balntas et al. 2020), which all extract dense local feature descriptors and 
then keeps the ones that have high saliency or repeatability. 


These last two papers also contains a nice review of other recent feature detectors, as does the 
paper by Balntas, Lenc et al. (2020). 
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Of course, keypoints are not the only features that can be used for registering images. 
Zoghlami, Faugeras, and Deriche (1997) use line segments as well as point-like features to 
estimate homographies between pairs of images, whereas Bartoli, Coquerelle, and Sturm 
(2004) use line segments with local correspondences along the edges to extract 3D structure 
and motion. Tuytelaars and Van Gool (2004) use affine invariant regions to detect corre- 
spondences for wide baseline stereo matching, whereas Kadir, Zisserman, and Brady (2004) 
detect salient regions where patch entropy and its rate of change with scale are locally max- 
imal. Corso and Hager (2005) use a related technique to fit 2D oriented Gaussian kernels 
to homogeneous regions. More details on techniques for finding and matching curves, lines, 


and regions can be found later in this chapter. 


7.1.2 Feature descriptors 


After detecting keypoint features, we must match them, i.e., we must determine which fea- 
tures come from corresponding locations in different images. In some situations, e.g., for 
video sequences (Shi and Tomasi 1994) or for stereo pairs that have been rectified (Zhang, 
Deriche et al. 1995; Loop and Zhang 1999; Scharstein and Szeliski 2002), the local motion 
around each feature point may be mostly translational. In this case, simple error metrics, such 
as the sum of squared differences or normalized cross-correlation, described in Section 9.1, 
can be used to directly compare the intensities in small patches around each feature point. 
(The comparative study by Mikolajczyk and Schmid (2005), discussed below, uses cross- 
correlation.) Because feature points may not be exactly located, a more accurate matching 
score can be computed by performing incremental motion refinement as described in Sec- 
tion 9.1.3, but this can be time-consuming and can sometimes even decrease performance 
(Brown, Szeliski, and Winder 2005). 

In most cases, however, the local appearance of features will change in orientation and 
scale, and sometimes even undergo affine deformations. Extracting a local scale, orientation, 
or affine frame estimate and then using this to resample the patch before forming the feature 
descriptor is thus usually preferable (Figure 7.15). 

Even after compensating for these changes, the local appearance of image patches will 
usually still vary from image to image. How can we make image descriptors more invariant to 
such changes, while still preserving discriminability between different (non-corresponding) 
patches? Mikolajezyk and Schmid (2005) review a number of view-invariant local image 
descriptors and experimentally compare their performance. More recently, Balntas, Lenc 
et al. (2020) and Jin, Mishkin et al. (2021) compare the large number of learned feature 


descriptors developed in the prior decade.* Below, we describe a few of these descriptors in 


3Many recent publications such as Tian, Yu et al. (2019) use their HPatches dataset to compare their performance 
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Figure 7.15 Once a local scale and orientation estimate has been determined, MOPS 
descriptors are formed using an 8 x 8 sampling of bias and gain normalized intensity values, 
with a sample spacing of five pixels relative to the detection scale (Brown, Szeliski, and 
Winder 2005) O 2005 IEEE. This low frequency sampling gives the features some robustness 
to interest point location error and is achieved by sampling at a higher pyramid level than 


the detection scale. 


more detail. 


Bias and gain normalization (MOPS). For tasks that do not exhibit large amounts of fore- 
shortening, such as image stitching, simple normalized intensity patches perform reasonably 
well and are simple to implement (Brown, Szeliski, and Winder 2005) (Figure 7.15). To com- 
pensate for slight inaccuracies in the feature point detector (location, orientation, and scale), 
multi-scale oriented patches (MOPS) are sampled at a spacing of five pixels relative to the 
detection scale, using a coarser level of the image pyramid to avoid aliasing. To compen- 
sate for affine photometric variations (linear exposure changes or bias and gain, (3.3)), patch 


intensities are re-scaled so that their mean is zero and their variance is one. 


Scale invariant feature transform (SIFT). SIFT features (Lowe 2004) are formed by com- 
puting the gradient at each pixel in a 16 x 16 window around the detected keypoint, using the 
appropriate level of the Gaussian pyramid at which the keypoint was detected. The gradient 
magnitudes are downweighted by a Gaussian fall-off function (shown as a blue circle in Fig- 
ure 7.16a) to reduce the influence of gradients far from the center, as these are more affected 
by small misregistrations. 

In each 4 x 4 quadrant, a gradient orientation histogram is formed by (conceptually) 
adding the gradient values weighted by the Gaussian fall-off function to one of eight orienta- 
tion histogram bins. To reduce the effects of location and dominant orientation misestimation, 


each of the original 256 weighted gradient magnitudes is softly added to 2 x 2 x 2 adjacent 


against previous approaches. 
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(a) image gradients (b) keypoint descriptor 


Figure 7.16 A schematic representation of Lowe’s (2004) scale invariant feature transform 
(SIFT): (a) Gradient orientations and magnitudes are computed at each pixel and weighted 
by a Gaussian fall-off function (blue circle). (b) A weighted gradient orientation histogram 
is then computed in each subregion, using trilinear interpolation. While this figure shows an 
& x 8 pixel patch and a 2 x 2 descriptor array, Lowe’s actual implementation uses 16 x 16 
patches and a 4 x 4 array of eight-bin histograms. 


histogram bins in the (x, y, 0) space using trilinear interpolation. Softly distributing values 
to adjacent histogram bins is generally a good idea in any application where histograms are 
being computed, e.g., for Hough transforms (Section 7.4.2) or local histogram equalization 
(Section 3.1.4). 

The 4x4 array of eight-bin histogram yields 128 non-negative values form a raw version 
of the SIFT descriptor vector. To reduce the effects of contrast or gain (additive variations are 
already removed by the gradient), the 128-D vector is normalized to unit length. To further 
make the descriptor robust to other photometric variations, values are clipped to 0.2 and the 


resulting vector is once again renormalized to unit length. 


PCA-SIFT. Ke and Sukthankar (2004) propose a simpler way to compute descriptors in- 
spired by SIFT; it computes the x and y (gradient) derivatives over a 39 x 39 patch and 
then reduces the resulting 3042-dimensional vector to 36 using principal component analysis 
(PCA) (Section 5.2.3 and Appendix A.1.2). Another popular variant of SIFT is SURF (Bay, 
Ess et al. 2008), which uses box filters to approximate the derivatives and integrals used in 
SIFT. 


RootSIFT. Arandjelovi¢ and Zisserman (2012) observe that by simply re-normalizing SIFT 
descriptors using an Lı measure and then taking the square root of each component, a dra- 


matic increase in performance (discriminability) can be obtained. 
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(a) image gradients (b) keypoint descriptor 


Figure 7.17 The gradient location-orientation histogram (GLOH) descriptor uses log- 
polar bins instead of square bins to compute orientation histograms (Mikolajczyk and Schmid 
2005). GLOH uses 16 gradient orientations inside each bin, although this figure only shows 
& to appear less cluttered. 


Gradient location-orientation histogram (GLOH). This descriptor, developed by Miko- 
lajczyk and Schmid (2005), is a variant of SIFT that uses a log-polar binning structure instead 
of the four quadrants used by Lowe (2004) (Figure 7.17). The spatial bins extend over the 
radii 0. ..6, 6...11, and 11...15, with eight angular bins (except for the single central region), 
for a total of 17 spatial bins and GLOH uses 16 orientation bins instead of the 8 used in SIFT. 
The 272-dimensional histogram is then projected onto a 128-dimensional descriptor using 
PCA trained on a large database. In their evaluation, Mikolajczyk and Schmid (2005) found 
that GLOH, which has the best performance overall, outperforms SIFT by a small margin. 


Steerable filters. Steerable filters (Section 3.2.3) are combinations of derivative of Gaus- 
sian filters that permit the rapid computation of even and odd (symmetric and anti-symmetric) 
edge-like and corner-like features at all possible orientations (Freeman and Adelson 1991). 
Because they use reasonably broad Gaussians, they too are somewhat insensitive to localiza- 


tion and orientation errors. 


Performance of local descriptors. Among the local descriptors that Mikolajezyk and Schmid 
(2005) compared, they found that GLOH performed best, followed closely by SIFT. They also 
present results for many other descriptors not covered in this book. 

The field of feature descriptors continued to advance rapidly, with some techniques look- 
ing at local color information (van de Weijer and Schmid 2006; Abdel-Hakim and Farag 
2006). Winder and Brown (2007) develop a multi-stage framework for feature descriptor 
computation that subsumes both SIFT and GLOH (Figure 7.18a) and also allows them to 
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Figure 7.18 Spatial summation blocks for SIFT, GLOH, and some related feature descrip- 
tors (Winder and Brown 2007) © 2007 IEEE: (a) The parameters for the features, e.g., their 
Gaussian weights, are learned from a training database of (b) matched real-world image 
patches obtained from robust structure from motion applied to internet photo collections 
(Hua, Brown, and Winder 2007). 


learn optimal parameters for newer descriptors that outperform previous hand-tuned descrip- 
tors. Hua, Brown, and Winder (2007) extend this work by learning lower-dimensional projec- 
tions of higher-dimensional descriptors that have the best discriminative power, and Brown, 
Hua, and Winder (2011) further extend it by learning the optimal placement of the pooling re- 
gions. All of these papers use a database of real-world image patches (Figure 7.18b) obtained 
by sampling images at locations that were reliably matched using a robust structure-from- 
motion algorithm applied to internet photo collections (Snavely, Seitz, and Szeliski 2006; 
Goesele, Snavely et al. 2007). In concurrent work, Tola, Lepetit, and Fua (2010) developed 
a similar DAISY descriptor for dense stereo matching and optimized its parameters based on 
ground truth stereo data. 


While these techniques construct feature detectors that optimize for repeatability across 
all object classes, it is also possible to develop class- or instance-specific feature detectors that 
maximize discriminability from other classes (Ferencz, Learned-Miller, and Malik 2008). If 
planar surface orientations can be determined in the images being matched, it is also possible 
to extract viewpoint-invariant patches (Wu, Clipp et al. 2008). 


A more recent trend has been the development of binary bit-string feature descriptors, 
which can take advantage of fast Hamming distance operators in modern computer architec- 
tures. The BRIEF descriptor (Calonder, Lepetit et al. 2010) compares 128 different pairs of 
pixel values (denoted as line segments in Figure 7.19a) scattered around the keypoint location 
to obtain a 128-bit vector. ORB (Rublee, Rabaud et al. 2011) adds an orientation component 
to the FAST detector before computing oriented BRIEF descriptors. BRISK (Leutenegger, 
Chli, and Siegwart 2011) adds scale-space analysis to the FAST detector and a radially sym- 
metric sampling pattern (Figure 7.19b) to produce the binary descriptor. FREAK (Alahi, 
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Figure 7.19 Binary bit-string feature descriptors: (a) the BRIEF descriptor compares 128 
pairs of pixel values (denoted by line segments) and stores the comparison results in a 128- 
bit vector (Calonder, Lepetit et al. 2010) © 2010 Springer; (b) BRISK sampling pattern and 
Gaussian blur radii; (Leutenegger, Chli, and Siegwart 2011) © 2011 IEEE; (c) FREAK reti- 
nal sampling pattern (Alahi, Ortiz, and Vandergheynst 2012) © 2012 IEEE. 


Ortiz, and Vandergheynst 2012) uses a more pronounced “retinal” (log-polar) sampling pat- 
tern paired with a cascade of bit comparisons for even greater speed and efficiency. The 
survey and evaluation by Mukherjee, Wu, and Wang (2015) compares all of these “classic” 
feature detectors and descriptors. 


Since 2015 or so, most of the new feature descriptors are constructed using deep learn- 
ing techniques, as surveyed in Balntas, Lenc et al. (2020) and Jin, Mishkin et al. (2021). 
Some of these descriptors, such as LIFT (Yi, Trulls et al. 2016), TFeat (Balntas, Riba et al. 
2016), HPatches (Balntas, Lenc et al. 2020), L2-Net (Tian, Fan, and Wu 2017), HardNet 
(Mishchuk, Mishkin et al. 2017), Geodesc (Luo, Shen et al. 2018), LF-Net (Ono, Trulls et 
al. 2018), SOSNet (Tian, Yu et al. 2019), and Key.Net (Barroso-Laguna, Riba et al. 2019) 
operate on patches, much like the classical SIFT approach. They hence require an initial local 
feature detector to determine the center of the patch and use a predetermined patch size when 
constructing the input to the network. 


In contrast, approaches such as DELF (Noh, Araujo et al. 2017), SuperPoint (DeTone, 
Malisiewicz, and Rabinovich 2018), D2-Net (Dusmanu, Rocco ef al. 2019), ContextDesc 
(Luo, Shen et al. 2019), R2D2 (Revaud, Weinzaepfel et al. 2019), ASLFeat (Luo, Zhou et al. 
2020), and CAPS (Wang, Zhou et al. 2020) use the entire image as the input to the descriptor 
computation. This has the added benefit that the receptive field used to compute the descriptor 
can be learned from the data and does not require specifying a patch size. Theoretically, these 


CNN models can learn receptive fields that use all of the pixels in the image, although in 
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Figure 7.20 HPatches local descriptors benchmark (Balntas, Lenc et al. 2020) © 2019 
IEEE: (a) chronology of feature descriptors; (b) typical patches in the dataset (grouped by 


Easy, Hard, and Tough); (c) size and speed of different descriptors. 


practice they tend to use Gaussian-like receptive fields (Zhou, Khosla et al. 2015; Luo, Li et 
al. 2016; Selvaraju, Cogswell et al. 2017). 

In the HPatches benchmark (Figure 7.20) for evaluating patch matching by Balntas, Lenc 
et al. (2020), HardNet and L2-net performed the best on average. Another paper (Wang, 
Zhou et al. 2020) shows CAPS and R2D2 as the best performers, while S2DNet (Germain, 
Bourmaud, and Lepetit 2020) and LISRD (Pautrat, Larsson et al. 2020) also claim state-of- 
the-art performance, while the WISW benchmark (Bellavia and Colombo 2020) shows that 
traditional descriptors such as SIFT enhanced with more recent ideas do the best. On the wide 
baseline image matching benchmark by Jin, Mishkin et al. (2021),* HardNet, Key.Net, and 
D2-Net were top performers (e.g., D2-Net had the highest number of landmarks), although 
the results were quite task-dependent and the Difference of Gaussian detector was still the 
best. The performance of these descriptors on matching features across large illumination 
differences (day-night) has also been studied (Radenovié, Schónberger et al. 2016; Zhou, 
Sattler, and Jacobs 2016; Mishkin 2021). 

The most recent trend in wide-baseline matching has been to densely extract features 
without a detector stage and to then match and refine the set of correspondences (Jiang, 
Trulls et al. 2021; Sarlin, Unagar et al. 2021; Sun, Shen et al. 2021; Truong, Danelljan et 
al. 2021; Zhou, Sattler, and Leal-Taixé 2021). Some of these more recent techniques have 
been evaluated by Mishkin (2021). 


4The benchmark is associated with the CVPR Workshop on Image Matching: Local Features & Beyond: https: 


/limage-matching- workshop. github.io. 
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7.1.3 Feature matching 


Once we have extracted features and their descriptors from two or more images, the next 
step is to establish some preliminary feature matches between these images. The approach 
we take depends partially on the application, e.g., different strategies may be preferable for 
matching images that are known to overlap (e.g., in image stitching) vs. images that may have 
no correspondence whatsoever (e.g., when trying to recognize objects from a database). 

In this section, we divide this problem into two separate components. The first is to select 
a matching strategy, which determines which correspondences are passed on to the next stage 
for further processing. The second is to devise efficient data structures and algorithms to 


perform this matching as quickly as possible, which we expand on in Section 7.1.4. 


Matching strategy and error rates 


Determining which feature matches are reasonable to process further depends on the context 
in which the matching is being performed. Say we are given two images that overlap to a 
fair amount (e.g., for image stitching or for tracking objects in a video). We know that most 
features in one image are likely to match the other image, although some may not match 
because they are occluded or their appearance has changed too much. 

On the other hand, if we are trying to recognize how many known objects appear in a 
cluttered scene (Figure 6.2), most of the features may not match. Furthermore, a large number 
of potentially matching objects must be searched, which requires more efficient strategies, as 
described below. 

To begin with, we assume that the feature descriptors have been designed so that Eu- 
clidean (vector magnitude) distances in feature space can be directly used for ranking poten- 
tial matches. If it turns out that certain parameters (axes) in a descriptor are more reliable 
than others, it is usually preferable to re-scale these axes ahead of time, e.g., by determin- 
ing how much they vary when compared against other known good matches (Hua, Brown, 
and Winder 2007). A more general process, which involves transforming feature vectors 
into a new scaled basis, is called whitening and is discussed in more detail in the context of 
eigenface-based face recognition (Section 5.2.3). 

Given a Euclidean distance metric, the simplest matching strategy is to set a threshold 
(maximum distance) and to return all matches from other images within this threshold. Set- 
ting the threshold too high results in too many false positives, i.e., incorrect matches being 
returned. Setting the threshold too low results in too many false negatives, 1.e., too many 
correct matches being missed (Figure 7.21). 


We can quantify the performance of a matching algorithm at a particular threshold by 
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Figure 7.21 False positives and negatives: The black digits 1 and 2 are features being 
matched against a database of features in other images. At the current threshold setting (the 
solid circles), the green 1 is a true positive (good match), the blue 1 is a false negative (failure 
to match), and the red 3 is a false positive (incorrect match). Ifwe set the threshold higher 
(the dashed circles), the blue 1 becomes a true positive but the brown 4 becomes an additional 


false positive. 


True matches True non-matches 


Predicted matches TP = 18 PPV = 0.82 
Predicted non-matches TN = 76 
Total = 100 


TPR = 0.90 FPR = 0.05 ACC = 0.94 


Table 7.1 The number of matches correctly and incorrectly estimated by a feature matching 
algorithm, showing the number of true positives (TP), false positives (FP), false negatives 
(EN), and true negatives (TN). The columns sum up to the actual number of positives (P) and 
negatives (N), while the rows sum up to the predicted number of positives (P') and negatives 
(N'). The formulas for the true positive rate (TPR), the false positive rate (FPR), the positive 
predictive value (PPV), and the accuracy (ACC) are given in the text. 


first counting the number of true and false matches and match failures, using the following 
definitions (Fawcett 2006), which we already discussed in Section 6.3.3: 

e TP: true positives, i.e., number of correct matches; 

e FN: false negatives, matches that were not correctly detected; 

e FP: false positives, proposed matches that are incorrect; 

e TN: true negatives, non-matches that were correctly rejected. 


Table 7.1 shows a sample confusion matrix (contingency table) containing such numbers. 
We can convert these numbers into unit rates by defining the following quantities (Fawcett 
2006): 
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e true positive rate (TPR), 


TP TP 
TPR = == = 7.14 
TP+FN P’ ( ) 
e false positive rate (FPR), 
PR = em = aa (7.15) 
FP+TN N 
e positive predictive value (PPV), 
TP TP 
* accuracy (ACC), 
TP+TN 
ACC = PAN (7.17) 


In the information retrieval (or document retrieval) literature (Baeza-Yates and Ribeiro- 
Neto 1999; Manning, Raghavan, and Schiitze 2008), the term precision (how many returned 
documents are relevant) is used instead of PPV and recall (what fraction of relevant docu- 
ments was found) is used instead of TPR (see also Section 6.3.3). The precision and recall 
can be combined into a single measure called the F-score, which is their harmonic mean. 
This single measure is often used to rank vision algorithms (Knapitsch, Park et al. 2017). 

Any particular matching strategy (at a particular threshold or parameter setting) can be 
rated by the TPR and FPR numbers; ideally, the true positive rate will be close to 1 and the 
false positive rate close to 0. As we vary the matching threshold, we obtain a family of such 
points, which are collectively known as the receiver operating characteristic (ROC) curve 
(Fawcett 2006) (Figure 7.22a). The closer this curve lies to the upper left corner, i.e., the 
larger the area under the curve (AUC), the better its performance. Figure 7.22b shows how 
we can plot the number of matches and non-matches as a function of inter-feature distance d. 
These curves can then be used to plot an ROC curve (Exercise 7.3). The ROC curve can also 
be used to calculate the mean average precision, which is the average precision (PPV) as you 
vary the threshold to select the best results, then the two top results, etc. (see Section 6.3.3 
and Figure 6.27). 

The problem with using a fixed threshold is that it is difficult to set; the useful range 
of thresholds can vary a lot as we move to different parts of the feature space (Lowe 2004; 
Mikolajczyk and Schmid 2005). A better strategy in such cases is to simply match the nearest 
neighbor in feature space. Since some features may have no matches (e.g., they may be part 
of background clutter in object recognition or they may be occluded in the other image), a 
threshold is still used to reduce the number of false positives. 

Ideally, this threshold itself will adapt to different regions of the feature space. If sufficient 


training data is available (Hua, Brown, and Winder 2007), it is sometimes possible to learn 
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Figure 7.22 ROC curve and its related rates: (a) The ROC curve plots the true positive 
rate against the false positive rate for a particular combination of feature extraction and 
matching algorithms. Ideally, the true positive rate should be close to 1, while the false 
positive rate is close to 0. The area under the ROC curve (AUC) is often used as a single 
(scalar) measure of algorithm performance. Alternatively, the equal error rate is sometimes 
used. (b) The distribution of positives (matches) and negatives (non-matches) as a function 
of inter-feature distance d. As the threshold 0 is increased, the number of true positives (TP) 


and false positives (FP) increases. 


different thresholds for different features. Often, however, we are simply given a collection 
of images to match, e.g., when stitching images or constructing 3D models from unordered 
photo collections (Brown and Lowe 2007, 2005; Snavely, Seitz, and Szeliski 2006). In this 
case, a useful heuristic can be to compare the nearest neighbor distance to that of the second 
nearest neighbor, preferably taken from an image that is known not to match the target (e.g., 
a different object in the database) (Brown and Lowe 2002; Lowe 2004; Mishkin, Matas, and 
Perdoch 2015). We can define this nearest neighbor distance ratio (Mikolajczyk and Schmid 
2005) as 

NNDR = % = Pa = Pall (7.18) 

da ||Da — Dell 


where dı and da are the nearest and second nearest neighbor distances, D4 is the target 


descriptor, and Dg and Dg are its closest two neighbors (Figure 7.23). Recent work has 
shown that mutual NNDR (or, at least NNDR with cross-consistency check) work noticeably 
better than one-way NNDR (Bellavia and Colombo 2020; Jin, Mishkin et al. 2021). 
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Figure 7.23 Fixed threshold, nearest neighbor, and nearest neighbor distance ratio match- 
ing. At a fixed distance threshold (dashed circles), descriptor D 4 fails to match Dg and Dp 
incorrectly matches Dc and Dg. If we pick the nearest neighbor, D 4 correctly matches Dg 
but Dp incorrectly matches Dg. Using nearest neighbor distance ratio (NNDR) matching, 
the small NNDR d; /d2 correctly matches D 4 with Dp, and the large NNDR d} /d correctly 


rejects matches for Dp. 


Efficient matching 


Once we have decided on a matching strategy, we still need to efficiently search for potential 
candidates. The simplest way to find all corresponding feature points is to compare all fea- 
tures against all other features in each pair of potentially matching images. While traditionally 
this has been too computationally expensive, modern GPUs have enabled such comparisons. 

A more efficient approach is to devise an indexing structure, such as a multi-dimensional 
search tree or a hash table, to rapidly search for features near a given feature. Such indexing 
structures can either be built for each image independently (which is useful if we want to only 
consider certain potential matches, e.g., searching for a particular object) or globally for all 
the images in a given database, which can potentially be faster, since it removes the need to it- 
erate over each image. For extremely large databases (millions of images or more), even more 
efficient structures based on ideas from document retrieval, e.g., vocabulary trees (Nistér and 
Stewénius 2006), product quantization (Jégou, Douze, and Schmid 2010; Johnson, Douze, 
and Jégou 2021), or an inverted multi-index (Babenko and Lempitsky 2015b) can be used, as 
discussed in Section 7.1.4. 

One of the simpler techniques to implement is multi-dimensional hashing, which maps 
descriptors into fixed size buckets based on some function applied to each descriptor vector. 
At matching time, each new feature is hashed into a bucket, and a search of nearby buckets 
1s used to return potential candidates, which can then be sorted or graded to determine which 
are valid matches. 


A simple example of hashing is the Haar wavelets used by Brown, Szeliski, and Winder 
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(2005) in their MOPS paper. During the matching structure construction, each 8 x 8 scaled, 
oriented, and normalized MOPS patch is converted into a three-element index by performing 
sums over different quadrants of the patch. The resulting three values are normalized by their 
expected standard deviations and then mapped to the two (of b = 10) nearest 1D bins. The 
three-dimensional indices formed by concatenating the three quantized values are used to 
index the 2% = 8 bins where the feature is stored (added). At query time, only the primary 
(closest) indices are used, so only a single three-dimensional bin needs to be examined. The 
coefficients in the bin can then be used to select k approximate nearest neighbors for further 


processing (such as computing the NNDR). 


A more complex, but more widely applicable, version of hashing is called locality sen- 
sitive hashing, which uses unions of independently computed hashing functions to index 
the features (Gionis, Indyk, and Motwani 1999; Shakhnarovich, Darrell, and Indyk 2006). 
Shakhnarovich, Viola, and Darrell (2003) extend this technique to be more sensitive to the 
distribution of points in parameter space, which they call parameter-sensitive hashing. More 
recent work converts high-dimensional descriptor vectors into binary codes that can be com- 
pared using Hamming distances (Torralba, Weiss, and Fergus 2008; Weiss, Torralba, and 
Fergus 2008) or that can accommodate arbitrary kernel functions (Kulis and Grauman 2009; 
Raginsky and Lazebnik 2009). 


Another widely used class of indexing structures are multi-dimensional search trees. The 
best known of these are k-d trees, also often written as kd-trees, which divide the multi- 
dimensional feature space along alternating axis-aligned hyperplanes, choosing the threshold 
along each axis so as to maximize some criterion, such as the search tree balance (Samet 
1989). Figure 7.24 shows an example of a two-dimensional k-d tree. Here, eight different data 
points A—H are shown as small diamonds arranged on a two-dimensional plane. The k-d tree 
recursively splits this plane along axis-aligned (horizontal or vertical) cutting planes. Each 
split can be denoted using the dimension number and split value (Figure 7.24b). The splits are 
arranged so as to try to balance the tree, i.e., to keep its maximum depth as small as possible. 
At query time, a classic k-d tree search first locates the query point (+) in its appropriate 
bin (D), and then searches nearby leaves in the tree (C, B, ...) until it can guarantee that 
the nearest neighbor has been found. The best bin first (BBF) search (Beis and Lowe 1999) 
searches bins in order of their spatial proximity to the query point and is therefore usually 


more efficient. 


Many additional data structures have been developed for solving exact and approximate 
nearest neighbor problems (Arya, Mount ef al. 1998; Liang, Liu et al. 2001; Hjaltason and 
Samet 2003). For example, Nene and Nayar (1997) developed a technique they call slicing 


that uses a series of 1D binary searches on the point list sorted along different dimensions 
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Figure 7.24 K-d tree and best bin first (BBF) search (Beis and Lowe 1999) O 1999 IEEE: 
(a) The spatial arrangement of the axis-aligned cutting planes is shown using dashed lines. 
Individual data points are shown as small diamonds. (b) The same subdivision can be repre- 
sented as a tree, where each interior node represents an axis-aligned cutting plane (e.g., the 
top node cuts along dimension dl at value .34) and each leaf node is a data point. During a 
BBF search, a query point (denoted by “+” ) first looks in its containing bin (D) and then in 


its nearest adjacent bin (B), rather than its closest neighbor in the tree (C). 


to efficiently cull down a list of candidate points that lie within a hypercube of the query 
point. Grauman and Darrell (2005) reweight the matches at different levels of an indexing 
tree, which allows their technique to be less sensitive to discretization errors in the tree con- 
struction. Nistér and Stewénius (2006) use a metric tree, which compares feature descriptors 
to a small number of prototypes at each level in a hierarchy. The resulting quantized visual 
words can then be used with classical information retrieval (document relevance) techniques 
to quickly winnow down a set of potential candidates from a database of millions of images 
(Section 7.1.4). Muja and Lowe (2009) compare a number of these approaches, introduce a 
new one of their own (priority search on hierarchical k-means trees), and conclude that multi- 
ple randomized k-d trees often provide the best performance. Modern libraries for computing 
approximate nearest neighbors include FLANN (Muja and Lowe 2014) and Faiss (Johnson, 
Douze, and Jégou 2021), which are discussed in Section 5.1.1 and Appendix C.2. 


Feature match verification and densification 


Once we have some candidate matches, we can use geometric alignment (Section 8.1) to 
verify which matches are inliers and which ones are outliers. For example, if we expect the 


whole image to be translated or rotated in the matching view, we can fit a global geometric 
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Figure 7.25 Visual words obtained from elliptical normalized affine regions (Sivic and Zis- 
serman 2009) O 2009 IEEE. (a) Affine covariant regions are extracted from each frame and 
clustered into visual words using k-means clustering on SIFT descriptors with a learned Ma- 
halanobis distance. (b) The central patch in each grid shows the query and the surrounding 


patches show the nearest neighbors. 


transform and keep only those feature matches that are sufficiently close to this estimated 
transformation. The process of selecting a small set of seed matches and then verifying a 
larger set is often called random sampling or RANSAC (Section 8.1.4). Once an initial set 
of correspondences has been established, some systems look for additional matches, e.g., by 
looking for additional correspondences along epipolar lines (Section 12.1) or in the vicinity 
of estimated locations based on the global transform. It is also possible to use deep neural 
networks to perform feature matching and filtering, as in the SuperGlue system of Sarlin, 
DeTone et al. (2020). These topics are discussed further in Sections 8.1 and 12.2. 


7.1.4 Large-scale matching and retrieval 


As the number of objects in the database starts to grow (say, billions of objects or video 
frames), the time it takes to match a new image against each database image can become 
prohibitive. Instead of comparing the images one at a time, techniques are needed to quickly 
narrow down the search to a few likely images, which can then be compared using a more 
conservative verification stage. 

The problem of quickly finding partial matches between documents is one of the cen- 
tral problems in information retrieval (IR) (Baeza-Yates and Ribeiro-Neto 1999; Manning, 
Raghavan, and Schiitze 2008). In computer vision, the problem of finding a particular object 
in a large collection is called content-based image retrieval (CBIR) (Smeulders, Worring et al. 
2000; Lew, Sebe ef al. 2006; Vasconcelos 2007; Datta, Joshi et al. 2008) or instance retrieval 
(Zheng, Yang, and Tian 2018). The basic approach in fast document retrieval algorithms is 
to precompute an inverted index between individual words and the documents (or web pages 


or news stories) where they occur. More precisely, the frequency of occurrence of particular 
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words in a document is used to quickly find documents that match a particular query. 


Sivic and Zisserman (2009) were the first to adapt IR techniques to visual search. In their 
Video Google system, affine invariant features are first detected in all the video frames they 
are indexing using both shape adapted regions around Harris feature points (Schaffalitzky 
and Zisserman 2002; Mikolajczyk and Schmid 2004) and maximally stable extremal regions 
(Matas, Chum et al. 2004; Section 7.1.1), as shown in Figure 7.25a. Next, 128-dimensional 
SIFT descriptors are computed from each normalized region (i.e., the patches shown in Fig- 
ure 7.25b). Then, an average covariance matrix for these descriptors is estimated by accumu- 
lating statistics for features tracked from frame to frame. The feature descriptor covariance 
Y is then used to define a Mahalanobis distance (5.32) between feature descriptors. In prac- 
tice, feature descriptors are whitened by pre-multiplying them by E7*/? so that Euclidean 
distances can be used.’ 


To apply fast information retrieval techniques to images, the high-dimensional feature 
descriptors that occur in each image must first be mapped into discrete visual words. Sivic 
and Zisserman (2003) perform this mapping using k-means clustering, while some of the later 
methods (Nistér and Stewénius 2006; Philbin, Chum et al. 2007) use alternative techniques, 
such as vocabulary trees or randomized forests. To keep the clustering time manageable, 
only a few hundred video frames are used to learn the cluster centers, which still involves 
estimating several thousand clusters from about 300,000 descriptors, although subsequent 
work has greatly extended this capacity (Nistér and Stewénius 2006; Philbin, Chum ef al. 
2007; Mikulik, Perdoch et al. 2013). At visual query time, each feature in a new query region 
(e.g., Figure 7.25a, which is a cropped region from a larger video frame) is mapped to its 
corresponding visual word. To keep very common patterns from contaminating the results, a 
stop list of the most common visual words is created and such words are dropped from further 
consideration. 

Once a query image or region has been mapped into its constituent visual words, likely 
matching images must then be retrieved from the database. The exact details of how this 
is done can be found in Sivic and Zisserman (2009), Nistér and Stewénius (2006), Philbin, 
Chum et al. (2007), Chum, Philbin et al. (2007), Philbin, Chum et al. (2008), and also in 
the first edition of this book (Szeliski 2010, Section 14.3.2). Because of the high efficiency 
in both quantizing and scoring features, the vocabulary-tree-based recognition system built 
by Nistér and Stewénius (2006) was able to process incoming images in real time against a 


database of 40,000 CD covers and at 1Hz when matching a database of one million frames 


5Note that the computation of feature covariances from matched feature points is much more sensible than simply 
performing a PCA on the descriptor space (Winder and Brown 2007). This corresponds roughly to the within-class 


scatter matrix we studied in Section 5.2.3. 
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Figure 7.26 Location or building recognition using randomized trees (Philbin, Chum et 
al. 2007) © 2007 IEEE. The left image is the query, the other images are the highest-ranked 


results. 


taken from six feature-length movies. 

Instance recognition systems continued to improve rapidly in the 2000s. Philbin, Chum ef 
al. (2007) showed that randomized forest of k-d trees perform better than vocabulary trees on 
a large location recognition task (Figure 7.26). They also compared the effects of using dif- 
ferent 2D motion models (Section 2.1.1) in the verification stage. In follow-on work, Chum, 
Philbin et al. (2007) applied another idea from information retrieval, namely query expansion, 
which involves re-submitting top-ranked images from the initial query as additional queries 
to generate additional candidate results.* Philbin, Chum et al. (2008) showed how to mitigate 
quantization problems in visual words selection using soft assignment, where each feature 
descriptor is mapped to a number of nearby visual words, which is similar to the multiple 
assignment idea proposed earlier by Jégou, Harzallah, and Schmid (2007). However, such 
techniques tend to reduce the sparsity of visual word vectors and increase the memory and 
computation costs. Jégou, Douze, and Schmid (2008) incorporated partial geometrical infor- 
mation and an explicit matching scheme between local descriptors in the initial large-scale 
image ranking stage. Taken together, these algorithms helped instance recognition algorithms 
perform Web-scale retrieval, matching, 3D reconstruction tasks (Agarwal, Furukawa et al. 
2010, 2011; Frahm, Fite-Georgel et al. 2010; Snavely, Simon et al. 2010). 

Since the “deep learning revolution” in 2012, researchers have started developing neu- 
ral feature detectors and descriptors (Sections 7.1.1 and 7.1.2) and sometimes combining 
them into end-to-end matching systems.’ Figure 7.27 shows some of the major milestones 
in instance retrieval, while Figure 7.28 shows the variety of different classic and CNN-based 
retrieval architectures that have been considered. The survey paper by Zheng, Yang, and Tian 


(2018) describes and contrasts these various algorithms in more detail and also provides an 


6 An alternative to query expansion is database-side augmentation (Arandjelovié and Zisserman 2012). 
7But note that some popular open-source large-scale reconstruction systems such as COLMAP still use traditional 


features and indexing schemes (Schónberger and Frahm 2016). 
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Figure 7.27 Milestones in instance retrieval (Zheng, Yang, and Tian 2018) O 2018 IEEE, 
showing the shift from hand-crafted feature-based retrieval to CNN-based approaches. 
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Figure 7.28 Typical pipeline for feature-based instance retrieval (Zheng, Yang, and Tian 
2018) © 2018 IEEE, showing the feature extraction, encoding, and indexing portions, which 


are often collapsed when using a deep learning framework. 
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experimental comparison of some of these algorithms on image retrieval datasets. You can 
also find more details on related techniques and systems in Section 6.2.3 on visual similar- 
ity search, which discusses global descriptors that represent an image with a single vector 
(Arandjelovic, Gronat et al. 2016; Radenovié, Tolias, and Chum 2019; Yang, Kien Nguyen 
et al. 2019; Cao, Araujo, and Sim 2020; Ng, Balntas et al. 2020; Tolias, Jenicek, and Chum 
2020) as alternatives to bags of local features, Section 11.2.3 on location recognition, and 


Section 11.4.6 on large-scale 3D reconstruction from community (internet) photos. 


7.1.5 Feature tracking 


An alternative to independently finding features in all candidate images and then matching 
them is to find a set of likely feature locations in a first image and to then search for their 
corresponding locations in subsequent images. This kind of detect then track approach is 
more widely used for video tracking applications, where the expected amount of motion and 
appearance deformation between adjacent frames is expected to be small. 

The process of selecting good features to track is closely related to selecting good features 
for more general recognition applications. In practice, regions containing high gradients in 
both directions, i.e., which have high eigenvalues in the auto-correlation matrix (7.8), provide 
stable locations at which to find correspondences (Shi and Tomasi 1994). 

In subsequent frames, searching for locations where the corresponding patch has low 
squared difference (7.1) often works well enough. However, if the images are undergo- 
ing brightness change, explicitly compensating for such variations (9.9) or using normalized 
cross-correlation (9.11) may be preferable. If the search range is large, it is also often more 
efficient to use a hierarchical search strategy, which uses matches in lower-resolution images 
to provide better initial guesses and hence speed up the search (Section 9.1.1). Alternatives 
to this strategy involve learning what the appearance of the patch being tracked should be and 
then searching for it in the vicinity of its predicted position (Avidan 2001; Jurie and Dhome 
2002; Williams, Blake, and Cipolla 2003). These topics are all covered in more detail in 
Section 9.1.3. 

If features are being tracked over longer image sequences, their appearance can undergo 
larger changes. You then have to decide whether to continue matching against the originally 
detected patch (feature) or to re-sample each subsequent frame at the matching location. The 
former strategy is prone to failure, as the original patch can undergo appearance changes such 
as foreshortening. The latter runs the risk of the feature drifting from its original location to 
some other location in the image (Shi and Tomasi 1994). (Mathematically, small misregis- 
tration errors compound to create a Markov random walk, which leads to larger drift over 


time.) 


a Es 
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Figure 7.29 Feature tracking using an affine motion model (Shi and Tomasi 1994) O 1994 
IEEE, Top row: image patch around the tracked feature location. Bottom row: image patch 
after warping back toward the first frame using an affine deformation. Even though the speed 
sign gets larger from frame to frame, the affine transformation maintains a good resemblance 
between the original and subsequent tracked frames. 


A preferable solution is to compare the original patch to later image locations using an 
affine motion model (Section 9.2). Shi and Tomasi (1994) first compare patches in neigh- 
boring frames using a translational model and then use the location estimates produced by 
this step to initialize an affine registration between the patch in the current frame and the 
base frame where a feature was first detected (Figure 7.29). In their system, features are only 
detected infrequently, i.e., only in regions where tracking has failed. In the usual case, an 
area around the current predicted location of the feature is searched with an incremental reg- 
istration algorithm (Section 9.1.3). The resulting tracker is often called the Kanade—Lucas— 
Tomasi (KLT) tracker. 

Since their original work on feature tracking, Shi and Tomasi’s approach has generated 
a plethora of follow-on papers and applications. Beardsley, Torr, and Zisserman (1996) use 
extended feature tracking combined with structure from motion (Chapter 11) to incremen- 
tally build up sparse 3D models from video sequences. Kang, Szeliski, and Shum (1997) 
tie together the corners of adjacent (regularly gridded) patches to provide some additional 
stability to the tracking, at the cost of poorer handling of occlusions. Tommasini, Fusiello 
et al. (1998) provide a better spurious match rejection criterion for the basic Shi and Tomasi 
algorithm, Collins and Liu (2003) provide improved mechanisms for feature selection and 
dealing with larger appearance changes over time, and Shafique and Shah (2005) develop 
algorithms for feature matching (data association) for videos with large numbers of mov- 
ing objects or points. Lepetit and Fua (2005) and Yilmaz, Javed, and Shah (2006) survey 
the larger field of object tracking, which includes not only feature-based techniques but also 


alternative techniques based on contour and region (Section 7.3). 
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Figure 7.30 Real-time head tracking using fast trained classifiers (Lepetit, Pilet, and Fua 
2004) O 2004 IEEE. 


A more recent development in feature tracking is the use of learning algorithms to build 
special-purpose recognizers to rapidly search for matching features anywhere in an image 
(Lepetit, Pilet, and Fua 2006; Hinterstoisser, Benhimane et al. 2008; Rogez, Rihan et al. 
2008; Ozuysal, Calonder et al. 2010). By taking the time to train classifiers on sample patches 
and their affine deformations, extremely fast and reliable feature detectors can be constructed, 
which enables much faster motions to be supported (Figure 7.30). Coupling such features to 
deformable models (Pilet, Lepetit, and Fua 2008) or structure-from-motion algorithms (Klein 
and Murray 2008) can result in even higher stability. 

While feature-based tracking is still widely used in real-time applications such as SLAM, 
autonomous navigation, and augmented reality (Section 11.5), a lot of current work on track- 
ing is focused on whole object tracking (Chellappa, Sankaranarayanan et al. 2010; Smeulders, 
Chu et al. 2014), which we study in more detail in Section 9.4.4. 


7.1.6 Application: Performance-driven animation 


One of the most compelling applications of fast feature tracking is performance-driven an- 
imation, i.e., the interactive deformation of a 3D graphics model based on tracking a user’s 
motions (Williams 1990; Litwinowicz and Williams 1994; Lepetit, Pilet, and Fua 2004). 
Buck, Finkelstein et al. (2000) present a system that tracks a user’s facial expressions 
and head motions and then uses them to morph among a series of hand-drawn sketches. An 
animator first extracts the eye and mouth regions of each sketch and draws control lines over 
each image (Figure 7.31a). At run time, a face-tracking system (Toyama 1998) determines 
the current location of these features (Figure 7.31b). The animation system decides which 
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(a) (b) (c) (d) 


Figure 7.31 Performance-driven, hand-drawn animation (Buck, Finkelstein et al. 2000) 
© 2000 ACM: (a) eye and mouth portions of hand-drawn sketch with their overlaid control 
lines; (b) an input video frame with the tracked features overlaid; (c) a different input video 


frame along with its (d) corresponding hand-drawn animation. 


input images to morph based on nearest neighbor feature appearance matching and triangular 
barycentric interpolation. It also computes the global location and orientation of the head 
from the tracked features. The resulting morphed eye and mouth regions are then composited 
back into the overall head model to yield a frame of hand-drawn animation (Figure 7.31d). 

In more recent work, Barnes, Jacobs et al. (2008) watch users animate paper cutouts on a 
desk and then turn the resulting motions and drawings into seamless 2D animations. Feature- 
based facial trackers continue to be widely used (Zollhófer, Thies et al. 2018), both in the 
visual effects industry, as well as for real-time smartphone augmented reality effects such as 
Facebook’s Spark AR Face Masks. 


7.2 Edges and contours 


While interest points are useful for finding image locations that can be accurately matched 
in 2D, edge points are far more plentiful and often carry important semantic associations. 
For example, the boundaries of objects, which also correspond to occlusion events in 3D, are 
usually delineated by visible contours. Other kinds of edges correspond to shadow boundaries 
or crease edges, where surface orientation changes rapidly. Isolated edge points can also be 
grouped into longer curves or contours, as well as straight line segments (Section 7.4). It 
is interesting that even young children have no difficulty in recognizing familiar objects or 


animals from such simple line drawings. 
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Figure 7.32 Human boundary detection (Martin, Fowlkes, and Malik 2004) O 2004 IEEE. 
The darkness of the edges corresponds to how many human subjects marked an object bound- 
ary at that location. 


7.2.1 Edge detection 


Given an image, how can we find the salient edges? Consider the color images in Figure 7.32. 
If someone asked you to point out the most “salient” or “strongest” edges or the object bound- 
aries, which ones would you trace? How closely do your perceptions match the edge images 
shown in Figure 7.32? 

Qualitatively, edges occur at boundaries between regions of different color, intensity, or 
texture (Martin, Fowlkes, and Malik 2004; Arbeláez, Maire et al. 2011; Pont-Tuset, Arbeláez 
et al. 2017). Unfortunately, segmenting an image into coherent regions is a difficult task, 
which we address in Section 7.5. Often, it is preferable to detect edges using only purely 
local information. 

Under such conditions, a reasonable approach is to define an edge as a location of rapid 
intensity or color variation. Think of an image as a height field. On such a surface, edges 
occur at locations of steep slopes, or equivalently, in regions of closely packed contour lines 
(on a topographic map). 

A mathematical way to define the slope and direction of a surface is through its gradient, 


J(x) = VI(x) = (Z. x) (x). (7.19) 


The local gradient vector J points in the direction of steepest ascent in the intensity function. 
Its magnitude is an indication of the slope or strength of the variation, while its orientation 
points in a direction perpendicular to the local contour. 

Unfortunately, taking image derivatives accentuates high frequencies and hence amplifies 


noise, as the proportion of noise to signal is larger at high frequencies. It is therefore prudent 
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to smooth the image with a low-pass filter prior to computing the gradient. Because we would 
like the response of our edge detector to be independent of orientation, a circularly symmetric 
smoothing filter is desirable. As we saw in Section 3.2, the Gaussian is the only separable 
circularly symmetric filter, so 1t is used in most edge detection algorithms. Canny (1986) 
discusses alternative filters and a number of researchers review alternative edge detection 
algorithms and compare their performance (Davis 1975; Nalwa and Binford 1986; Nalwa 
1987; Deriche 1987; Freeman and Adelson 1991; Nalwa 1993; Heath, Sarkar et al. 1998; 
Crane 1997; Ritter and Wilson 2000; Bowyer, Kranenburg, and Dougherty 2001; Arbeláez, 
Maire et al. 2011; Pont-Tuset, Arbeláez et al. 2017). 

Because differentiation is a linear operation, it commutes with other linear filtering oper- 


ations. The gradient of the smoothed image can therefore be written as 
Jo(x) = V[Go(x) * I(x)] = [VG.](x) * I(x), (7.20) 


1.e., we can convolve the image with the horizontal and vertical derivatives of the Gaussian 
kernel function, 
OG, OG, 1 a? + y? 
VGo(x) = , = ; 7.21 
w= (2, e) Oe eo (E 021) 


where the parameter o indicates the width of the Gaussian. This is the same computation 


that is performed by Freeman and Adelson’s (1991) first-order steerable filter, which we have 
already covered in Section 3.2.3. 

For many applications, however, we wish to thin such a continuous gradient image to 
return isolated edges only, i.e., as single pixels at discrete locations along the edge contours. 
This can be achieved by looking for maxima in the edge strength (gradient magnitude) in a 
direction perpendicular to the edge orientation, 1.e., along the gradient direction. 

Finding this maximum corresponds to taking a directional derivative of the strength field 
in the direction of the gradient and then looking for zero crossings. The desired directional 
derivative is equivalent to the dot product between a second gradient operator and the results 
of the first, 


So(x) = V -Jo(x) = [V?G,](x) * I(x). (7.22) 
The gradient operator dot product with the gradient is called the Laplacian. The convolution 
kernel 5 3 
2 {2 Fy 2 
V°Go(x) = ( a 3) Go(x), (7.23) 


is therefore called the Laplacian of Gaussian (LoG) kernel (Marr and Hildreth 1980). This 


kernel can be split into two separable parts, 


Vat) = (353) 006.0 + (252) GawGae) 020 


20% o? 204% o 
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(Wiejak, Buxton, and Buxton 1985), which allows for a much more efficient implementation 
using separable filtering (Section 3.2.1). 

In practice, it is quite common to replace the Laplacian of Gaussian convolution with a 
difference of Gaussian (DoG) computation, since the kernel shapes are qualitatively similar 
(Figure 3.34). This is especially convenient if a “Laplacian pyramid” (Section 3.5) has already 
been computed.* 

In fact, it is not strictly necessary to take differences between adjacent levels when com- 
puting the edge field. Think about what a zero crossing in a “generalized” difference of 
Gaussians image represents. The finer (smaller kernel) Gaussian is a noise-reduced version 
of the original image. The coarser (larger kernel) Gaussian is an estimate of the average in- 
tensity over a larger region. Thus, whenever the DoG image changes sign, this corresponds 
to the (slightly blurred) image going from relatively darker to relatively lighter, as compared 
to the average intensity in that neighborhood. 

Once we have computed the sign function S(x), we must find its zero crossings and 
convert these into edge elements (edgels). An easy way to detect and represent zero crossings 
is to look for adjacent pixel locations x; and x; where the sign changes value, i.e., [S(x;) > 
0] 4 [S(x;) > 0). 

The sub-pixel location of this crossing can be obtained by computing the “x-intercept” of 


the “line” connecting S(x;) and S(x;), 


x¡S (xj) — xj S (xi) 
S(xj) — S(xi) ` 


xX, = (7.25) 
The orientation and strength of such edgels can be obtained by linearly interpolating the 
gradient values computed on the original pixel grid. 

An alternative edgel representation can be obtained by linking adjacent edgels on the dual 
grid to form edgels that live inside each square formed by four adjacent pixels in the original 
pixel grid.” The advantage of this representation is that the edgels now live on a grid offset by 
half a pixel from the original pixel grid and are thus easier to store and access. As before, the 
orientations and strengths of the edges can be computed by interpolating the gradient field or 
estimating these values from the difference of Gaussian image (see Exercise 7.7). 

In applications where the accuracy of the edge orientation is more important, higher-order 
steerable filters can be used (Freeman and Adelson 1991) (see Section 3.2.3). Such filters are 
more selective for more elongated edges and also have the possibility of better modeling curve 


SRecall that Burt and Adelson’s (1983a) “Laplacian pyramid” actually computes differences of Gaussian-filtered 
levels. 

"This algorithm is a 2D version of the 3D marching cubes isosurface extraction algorithm (Lorensen and Cline 
1987). 


7.2 Edges and contours 459 


Figure 7.33 Scale selection for edge detection (Elder and Zucker 1998) O 1998 IEEE: 
(a) original image; (b-c) Canny/Deriche edge detector tuned to the finer (mannequin) and 
coarser (shadow) scales; (d) minimum reliable scale for gradient estimation; (e) minimum 
reliable scale for second derivative estimation, (f) final detected edges. 


intersections because they can represent multiple orientations at the same pixel (Figure 3.16). 
Their disadvantage is that they are more expensive to compute and the directional derivative 
of the edge strength does not have a simple closed form solution.'% 


Scale selection and blur estimation 


As we mentioned before, the derivative, Laplacian, and Difference of Gaussian filters (7.20 
7.23) all require the selection of a spatial scale parameter o. If we are only interested in 
detecting sharp edges, the width of the filter can be determined from image noise characteris- 
tics (Canny 1986; Elder and Zucker 1998). However, if we want to detect edges that occur at 
different resolutions (Figures 7.33b—c), a scale-space approach that detects and then selects 
edges at different scales may be necessary (Witkin 1983; Lindeberg 1994, 1998a; Nielsen, 
Florack, and Deriche 1997). 


10Tn fact, the edge orientation can have a 180° ambiguity for “bar edges”, which makes the computation of zero 


crossings in the derivative more tricky. 
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Elder and Zucker (1998) present a principled approach to solving this problem. Given 
a known image noise level, their technique computes, for every pixel, the minimum scale 
at which an edge can be reliably detected (Figure 7.33d). Their approach first computes 
gradients densely over an image by selecting among gradient estimates computed at different 
scales, based on their gradient magnitudes. It then performs a similar estimate of minimum 
scale for directed second derivatives and uses zero crossings of this latter quantity to robustly 
select edges (Figures 7.33e—f). As an optional final step, the blur width of each edge can 
be computed from the distance between extrema in the second derivative response minus the 
width of the Gaussian filter. 


Color edge detection 


While most edge detection techniques have been developed for grayscale images, color im- 
ages can provide additional information. For example, noticeable edges between iso-luminant 
colors (colors that have the same luminance) are useful cues but fail to be detected by grayscale 
edge operators. 

One simple approach is to combine the outputs of grayscale detectors run on each color 
band separately.!! However, some care must be taken. For example, if we simply sum up 
the gradients in each of the color bands, the signed gradients may actually cancel each other! 
(Consider, for example a pure red-to-green edge.) We could also detect edges independently 
in each band and then take the union of these, but this might lead to thickened or doubled 
edges that are hard to link. 

A better approach is to compute the oriented energy in each band (Morrone and Burr 
1988; Perona and Malik 1990a), e.g., using a second-order steerable filter (Section 3.2.3) 
(Freeman and Adelson 1991), and then sum up the orientation-weighted energies and find 
their joint best orientation. Unfortunately, the directional derivative of this energy may not 
have a closed form solution (as in the case of signed first-order steerable filters), so a simple 
zero crossing-based strategy cannot be used. However, the technique described by Elder and 
Zucker (1998) can be used to compute these zero crossings numerically instead. 

An alternative approach is to estimate local color statistics in regions around each pixel 
(Ruzon and Tomasi 2001; Martin, Fowlkes, and Malik 2004). This has the advantage that 
more sophisticated techniques (e.g., 3D color histograms) can be used to compare regional 
statistics and that additional measures, such as texture, can also be considered. Figure 7.34 


shows the output of such detectors. 


‘Instead of using the raw RGB space, a more perceptually uniform color space such as L*a*b* (see Section 2.3.2) 
can be used instead. When trying to match human performance (Martin, Fowlkes, and Malik 2004), this makes sense. 


However, in terms of the physics of the underlying image formation and sensing, it may be a questionable strategy. 
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Over the years, many other approaches have been developed for detecting color edges, 
dating back to early work by Nevatia (1977). Ruzon and Tomasi (2001) and Gevers, van de 
Weijer, and Stokman (2006) provide good reviews of these approaches, which include ideas 
such as fusing outputs from multiple channels, using multidimensional gradients, and vector- 
based methods. 


Combining edge feature cues 


If the goal of edge detection is to match human boundary detection performance (Bowyer, 
Kranenburg, and Dougherty 2001; Martin, Fowlkes, and Malik 2004; Arbeláez, Maire et 
al. 2011; Pont-Tuset, Arbeláez et al. 2017), as opposed to simply finding stable features for 
matching, even better detectors can be constructed by combining multiple low-level cues such 
as brightness, color, and texture. 

Martin, Fowlkes, and Malik (2004) describe a system that combines brightness, color, 
and texture edges to produce state-of-the-art performance on a database of hand-segmented 
natural color images (Martin, Fowlkes et al. 2001). First, they construct and train separate 
oriented half-disc detectors for measuring significant differences in brightness (luminance), 
color (a* and b* channels, summed responses), and texture (un-normalized filter bank re- 
sponses from the work of Malik, Belongie ef al. (2001)). Some of the responses are then 
sharpened using a soft non-maximal suppression technique. Finally, the outputs of the three 
detectors are combined using a variety of machine-learning techniques, from which logistic 
regression is found to have the best tradeoff between speed, space, and accuracy . The result- 
ing system (see Figure 7.34 for some examples) is shown to outperform previously developed 
techniques. Maire, Arbelaez ef al. (2008) improve on these results by combining the detector 
based on local appearance with a spectral (segmentation-based) detector (Belongie and Malik 
1998). In follow-on work, Arbeláez, Maire et al. (2011) build a hierarchical segmentation on 
top of this edge detector using a variant of the watershed algorithm. 


7.2.2 Contour detection 


While isolated edges can be useful for a variety of applications, such as line detection (Sec- 
tion 7.4) and sparse stereo matching (Section 12.2), they become even more useful when 
linked into continuous contours. 

If the edges have been detected using zero crossings of some function, linking them up 
is straightforward, since adjacent edgels share common endpoints. Linking the edgels into 
chains involves picking up an unlinked edgel and following its neighbors in both directions. 


Either a sorted list of edgels (sorted first by x coordinates and then by y coordinates, for 
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BG+CG+TG 


Figure 7.34 Combined brightness, color, texture boundary detector (Martin, Fowlkes, and 
Malik 2004) © 2004 IEEE. Successive rows show the outputs of the brightness gradient (BG), 
color gradient (CG), texture gradient (TG), and combined (BG+CG+TG) detectors. The final 
row shows human-labeled boundaries derived from a database of hand-segmented images 
(Martin, Fowlkes et al. 2001). 
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Figure 7.35 Some coding alternatives for linked contours. (a) A chain code representation 
of a grid-aligned linked edge chain. The code is represented as a series of direction codes, 
e.g, 0107685, which can further be compressed using predictive and run-length coding. 
(b-c) Arc-length parameterization of a contour. Discrete points along the contour (b) are 
first transcribed as (c) (x,y) pairs along the arc length s. This curve can then be regularly 


re-sampled or converted into alternative (e.g., Fourier) representations. 


example) or a 2D array can be used to accelerate the neighbor finding. If edges were not 
detected using zero crossings, finding the continuation of an edgel can be tricky. In this 
case, comparing the orientation (and, optionally, phase) of adjacent edgels can be used for 
disambiguation. Ideas from connected component computation can also sometimes be used 
to make the edge linking process even faster (see Exercise 7.8). 

Once the edgels have been linked into chains, we can apply an optional thresholding 
with hysteresis to remove low-strength contour segments (Canny 1986). The basic idea of 
hysteresis is to set two different thresholds and allow a curve being tracked above the higher 
threshold to dip in strength down to the lower threshold. 

Linked edgel lists can be encoded more compactly using a variety of alternative repre- 
sentations. A chain code encodes a list of connected points lying on an Ng grid using a 
three-bit code corresponding to the eight cardinal directions (N, NE, E, SE, S, SW, W, NW) 
between a point and its successor (Figure 7.35a). While this representation is more compact 
than the original edgel list (especially if predictive variable-length coding is used), it is not 
very suitable for further processing. 

A more useful representation is the arc length parameterization of a contour, x(s), where 
s denotes the arc length along a curve. Consider the linked set of edgels shown in Fig- 
ure 7.35b. We start at one point (the dot at (1.0, 0.5) in Figure 7.35c) and plot it at coordinate 
s = 0 (Figure 7.35c). The next point at (2.0,0.5) gets plotted at s = 1, and the next point 
at (2.5, 1.0) gets plotted at s = 1.7071, i.e., we increment s by the length of each edge seg- 


ment. The resulting plot can be resampled on a regular (say, integral) s grid before further 
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s=0=1 


Figure 7.36 Matching two contours using their arc-length parameterization. If both curves 
are normalized to unit length, s € [0,1] and centered around their centroid Xo, they will have 
the same descriptor up to an overall “temporal” shift (due to different starting points for 


s = 0) and a phase (x-y) shift (due to rotation). 


(a) (b) 


Figure 7.37 Curve smoothing with a Gaussian kernel (Lowe 1988) © 1998 IEEE: (a) 


without a shrinkage correction term; (b) with a shrinkage correction term. 


processing. 

The advantage of the arc-length parameterization is that it makes matching and processing 
(e.g., smoothing) operations much easier. Consider the two curves describing similar shapes 
shown in Figure 7.36. To compare the curves, we first subtract the average values xo = 
de x(s) from each descriptor. Next, we rescale each descriptor so that s goes from 0 to 1 
instead of 0 to S, i.e., we divide x(s) by S. Finally, we take the Fourier transform of each 
normalized descriptor, treating each x = (x,y) value as a complex number. If the original 
curves are the same (up to an unknown scale and rotation), the resulting Fourier transforms 
should differ only by a scale change in magnitude plus a constant complex phase shift, due 
to rotation, and a linear phase shift in the domain, due to different starting points for s (see 


Exercise 7.9). 


Arc-length parameterization can also be used to smooth curves to remove digitization 
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Figure 7.38 Changing the character of a curve without affecting its sweep (Finkelstein and 
Salesin 1994) © 1994 ACM: higher frequency wavelets can be replaced with exemplars from 
a Style library to effect different local appearances. 


noise. However, if we just apply a regular smoothing filter, the curve tends to shrink on 
itself (Figure 7.37a). Lowe (1989) and Taubin (1995) describe techniques that compensate 
for this shrinkage by adding an offset term based on second derivative estimates or a larger 
smoothing kernel (Figure 7.37b). An alternative approach, based on selectively modifying 
different frequencies in a wavelet decomposition, is presented by Finkelstein and Salesin 
(1994). In addition to controlling shrinkage without affecting its “sweep”, wavelets allow the 
“character” of a curve to be interactively modified, as shown in Figure 7.38. 

The evolution of curves as they are smoothed and simplified is related to “grassfire” (dis- 
tance) transforms and region skeletons (Section 3.3.3) (Tek and Kimia 2003), and can be used 
to recognize objects based on their contour shape (Sebastian and Kimia 2005). More local de- 
scriptors of curve shape such as shape contexts (Belongie, Malik, and Puzicha 2002) can also 
be used for recognition and are potentially more robust to missing parts due to occlusions. 

The field of contour detection and linking continues to evolve rapidly and now includes 
techniques for global contour grouping, boundary completion, and junction detection (Maire, 
Arbelaez et al. 2008), as well as grouping contours into likely regions (Arbeláez, Maire et 
al. 2011) and wide-baseline correspondence (Meltzer and Soatto 2008). Some additional 
papers that address contour detection include Xiaofeng and Bo (2012), Lim, Zitnick, and 
Dollár (2013), Dollár and Zitnick (2015), Xie and Tu (2015), and Pont-Tuset, Arbeláez et al. 
(2017). 


7.2.3 Application: Edge editing and enhancement 


While edges can serve as components for object recognition or features for matching, they 
can also be used directly for image editing. 


In fact, if the edge magnitude and blur estimate are kept along with each edge, a visually 
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(b) 


(d) 


Figure7.39 Image editing in the contour domain (Elder and Goldberg 2001) O 2001 IEEE: 
(a) and (d) original images; (b) and (e) extracted edges (edges to be deleted are marked in 


white); (c) and (f) reconstructed edited images. 


similar image can be reconstructed from this information (Elder 1999). Based on this princi- 
ple, Elder and Goldberg (2001) propose a system for “image editing in the contour domain”. 
Their system allows users to selectively remove edges corresponding to unwanted features 
such as specularities, shadows, or distracting visual elements. After reconstructing the image 
from the remaining edges, the undesirable visual features have been removed (Figure 7.39). 

Another potential application is to enhance perceptually salient edges while simplifying 
the underlying image to produce a cartoon-like or “pen-and-ink” stylized image (DeCarlo and 
Santella 2002). This application is discussed in more detail in Section 10.5.2. 


7.3 Contour tracking 


While lines, vanishing points, and rectangles are commonplace in the human-made world, 
curves corresponding to object boundaries are even more common, especially in the natural 
environment. In this section, we describe some approaches to locating such boundary curves 
in images. 

The first, originally called snakes by its inventors (Kass, Witkin, and Terzopoulos 1988) 


(Section 7.3.1), is an energy-minimizing, two-dimensional spline curve that evolves (moves) 
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towards image features such as strong edges. The second, intelligent scissors (Mortensen 
and Barrett 1995) (Section 7.3.1), allows the user to sketch in real time a curve that clings to 
object boundaries. Finally, level set techniques (Section 7.3.2) evolve the curve as the zero- 
set of a characteristic function, which allows them to easily change topology and incorporate 
region-based statistics. 

All three of these are examples of active contours (Blake and Isard 1998; Mortensen 
1999), since these boundary detectors iteratively move towards their final solution under the 
combination of image and optional user-guidance forces. The presentation below is heavily 
shortened from that presented in the first edition of this book (Szeliski 2010, Section 5.1), 


where interested readers can find more details. 


7.3.1 Snakes and scissors 


Snakes are a two-dimensional generalization of the 1D energy-minimizing splines first intro- 


duced in Section 4.2, 
Eint = ¡ESO + B(s)||fss(s)|I? ds, (7.26) 


where s is the arc-length along the curve f(s) = (x(s), y(s)) and a(s) and 8(s) are first- 
and second-order continuity weighting functions analogous to the s(x, y) and c(x, y) terms 
introduced in (4,244.25). We can discretize this energy by sampling the initial curve position 
evenly along its length (Figure 7.35c) to obtain 


Eim = Y (iF +1) — FONR (7.27) 


i 


+ BUF 1) — 2£(0) + fa- D’, 


where A is the step size, which can be neglected if we resample the curve along its arc-length 
after each iteration. 

In addition to this internal spline energy, a snake simultaneously minimizes external 
image-based and constraint-based potentials. The image-based potentials are the sum of sev- 


eral terms 


Eimage = WiineÉline T WedgeLedge A WtermEterm; (7.28) 


where the line term attracts the snake to dark ridges, the edge term attracts it to strong gra- 
dients (edges), and the term term attracts it to line terminations. As the snakes evolve by 
minimizing their energy, they often “wiggle” and “slither”, which accounts for their popular 


name. 
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Figure 7.40 Elastic net: The open squares indicate the cities and the closed squares linked 
by straight line segments are the tour points. The blue circles indicate the approximate extent 
of the attraction force of each city, which is reduced over time. Under the Bayesian interpre- 
tation of the elastic net, the blue circles correspond to one standard deviation of the circular 


Gaussian that generates each city from some unknown tour point. 


Because regular snakes have a tendency to shrink, it is usually better to initialize them 
by drawing the snake outside the object of interest to be tracked. Alternatively, an expansion 
ballooning force can be added to the dynamics (Cohen and Cohen 1993), essentially moving 
each point outwards along its normal. It is also possible to replace the energy-minimizing 
variational evolution equations with a deep neural network to significantly improve perfor- 


mance (Peng, Jiang et al. 2020). 


Elastic nets and slippery springs 


An interesting variant on snakes, first proposed by Durbin and Willshaw (1987) and later 
re-formulated in an energy-minimizing framework by Durbin, Szeliski, and Yuille (1989), is 
the elastic net formulation of the Traveling Salesman Problem (TSP). Recall that in a TSP, 
the salesman must visit each city once while minimizing the total distance traversed. A snake 
that is constrained to pass through each city could solve this problem (without any optimality 
guarantees) but it is impossible to tell ahead of time which snake control point should be 


associated with each city. 


Instead of having a fixed constraint between snake nodes and cities, a city is assumed to 
pass near some point along the tour (Figure 7.40). In a probabilistic interpretation, each city 
is generated as a mixture of Gaussians centered at each tour point, 


-a2 /(20? 
pA) = So pig with pij = 4/09), (7.29) 
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where ø is the standard deviation of the Gaussian and 
dij = 1£(i) — d()| (7.30) 


is the Euclidean distance between a tour point f (i) and a city location d(j). The correspond- 
ing data fitting energy (negative log likelihood) is 


Esippery = — Y log p(d(j)) = — Y log D ¿IO -a)l /20°] (1.31) 
j Jj 


This energy derives its name from the fact that, unlike a regular spring, which couples a 
given snake point to a given constraint, this alternative energy defines a slippery spring that 
allows the association between constraints (cities) and curve (tour) points to evolve over time 
(Szeliski 1989). Note that this is a soft variant of the popular iterative closest point data 
constraint that is often used in fitting or aligning surfaces to data points or to each other 
(Section 13.2.1) (Besl and McKay 1992; Chen and Medioni 1992; Zhang 1994). 

To compute a good solution to the TSP, the slippery spring data association energy is 
combined with a regular first-order internal smoothness energy (7.27) to define the cost of 
a tour. The tour f(s) is initialized as a small circle around the mean of the city points and 
a is progressively lowered (Figure 7.40). For large o values, the tour tries to stay near the 
centroid of the points but as o decreases each city pulls more and more strongly on its closest 
tour points (Durbin, Szeliski, and Yuille 1989). In the limit as o —> 0, each city is guaranteed 


to capture at least one tour point and the tours between subsequent cites become straight lines. 


Splines and shape priors 


While snakes can be very good at capturing the fine and irregular detail in many real-world 
contours, they sometimes exhibit too many degrees of freedom, making it more likely that 
they can get trapped in local minima during their evolution. 

One solution to this problem is to control the snake with fewer degrees of freedom through 
the use of B-spline approximations (Menet, Saint-Marc, and Medioni 1990b,a; Cipolla and 
Blake 1990). The resulting B-snake can be written as 


f(s) = 5 By,(s)Xx- (7.32) 
k 


If the object being tracked or recognized has large variations in location, scale, or ori- 
entation, these can be modeled as an additional transformation on the control points, e.g., 
x), = SRx; + t (2.18), which can be estimated at the same time as the values of the control 
points. Alternatively, separate detection and alignment stages can be run to first localize and 


orient the objects of interest (Cootes, Cooper et al. 1995). 
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Figure 7.41 Active Shape Model (ASM): (a) the effect of varying the first four shape pa- 
rameters for a set of faces (Cootes, Taylor et al. 1993) O 1993 IEEE; (b) searching for the 
strongest gradient along the normal to each control point (Cootes, Cooper et al. 1995) O 
1995 Elsevier. 


In a B-snake, because the snake is controlled by fewer degrees of freedom, there is less 
need for the internal smoothness forces used with the original snakes, although these can still 
be derived and implemented using finite element analysis, 1.e., taking derivatives and integrals 
of the B-spline basis functions (Terzopoulos 1983; Bathe 2007). 

In practice, 1t is more common to estimate a set of shape priors on the typical distribution 
of the control points {x+ } (Cootes, Cooper et al. 1995). One potential way of describing this 
distribution would be by the location X; and 2D covariance Cy of each individual point xx. 
These could then be turned into a quadratic penalty (prior energy) on the point location. In 
practice, however, the variation in point locations is usually highly correlated. 

A preferable approach is to estimate the joint covariance of all the points simultaneously. 
First, concatenate all of the point locations {xx} into a single vector x, e.g., by interleaving 
the x and y locations of each point. The distribution of these vectors across all training 


examples can be described with a mean X and a covariance 


1 
C= p LA X)(x, — x)", (7.33) 


where x, are the P training examples. Using eigenvalue analysis (Appendix A.1.2), which 
is also known as principal component analysis (PCA) (Section 5.2.3 and Appendix B.1), the 
covariance matrix can be written as, 


C = @ diag(Mo...Ax-1) BT. (7.34) 
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In most cases, the likely appearance of the points can be modeled using only a few eigen- 
vectors with the largest eigenvalues. The resulting point distribution model (Cootes, Taylor 


et al. 1993; Cootes, Cooper et al. 1995) can be written as 
x=x+0b, (7.35) 


where bis an M < K element shape parameter vector and Ê are the first m columns of $. 
To constrain the shape parameters to reasonable values, we can use a quadratic penalty of the 
form 

Bij sb diag(Ao . . . Am—1) b= y Ds (7.36) 


Alternatively, the range of allowable bm values can be limited to some range, e.g., |b,,] < 
3 Am (Cootes, Cooper et al. 1995). Alternative approaches for deriving a set of shape vec- 
tors are reviewed by Isard and Blake (1998). Varying the individual shape parameters bm 
over the range —2VÀm < 2y/A, can give a good indication of the expected variation in 
appearance, as shown in Figure 7.41a. 

To align a point distribution model with an image, each control point searches in a di- 
rection normal to the contour to find the most likely corresponding image edge point (Fig- 
ure 7.41b). These individual measurements can be combined with priors on the shape pa- 
rameters (and, if desired, position, scale, and orientation parameters) to estimate a new set 
Of parameters. The resulting active shape model (ASM) can be iteratively minimized to fit 
images to non-rigidly deforming objects, such as medical images, or body parts, such as 
hands (Cootes, Cooper et al. 1995). The ASM can also be combined with a PCA analysis of 
the underlying gray-level distribution to create an active appearance model (AAM) (Cootes, 


Edwards, and Taylor 2001), which we discussed in more detail in Section 6.2.4. 


Dynamic snakes and CONDENSATION 


In many applications of active contours, the object of interest is being tracked from frame 
to frame as it deforms and evolves. In this case, it makes sense to use estimates from the 
previous frame to predict and constrain the new estimates. 

One way to do this is to use Kalman filtering, which results in a formulation called Kalman 
snakes (Terzopoulos and Szeliski 1992; Blake, Curwen, and Zisserman 1993). The Kalman 
filter is based on a linear dynamic model of shape parameter evolution, 


xt = AX+_1 + Wł, (1.37) 


where x, and x;_, are the current and previous state variables, A is the linear transition 


matrix, and w is a noise (perturbation) vector, which is often modeled as a Gaussian (Gelb 
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Figure 7.42 Head tracking using CONDENSATION (Isard and Blake 1998) © 1998 
Springer: (a) sample set representation of head estimate distribution; (b) multiple measure- 


ments at each control vertex location; (c) multi-hypothesis tracking over time. 


1974). The matrices A and the noise covariance can be learned ahead of time by observing 
typical sequences of the object being tracked (Blake and Isard 1998). 

In many situations, however, such as when tracking in clutter, a better estimate for the 
contour can be obtained if we remove the assumptions that the distributions are Gaussian, 
which is what the Kalman filter requires. In this case, a general multi-modal distribution 
is propagated. To model such multi-modal distributions, Isard and Blake (1998) introduced 
the use of particle filtering to the computer vision community.!? Particle filtering techniques 
represent a probability distribution using a collection of weighted point samples (Andrieu, de 
Freitas et al. 2003; Bishop 2006; Koller and Friedman 2009). 

To update the locations of the samples according to the linear dynamics (deterministic 
drift), the centers of the samples are updated and multiple samples are generated for each 
point. These are then perturbed to account for the stochastic diffusion, i.e., their locations are 
moved by random vectors taken from the distribution of w.!* Finally, the weights of these 
samples are multiplied by the measurement probability density, i.e., we take each sample 
and measure its likelihood given the current (new) measurements. Because the point samples 
represent and propagate conditional estimates of the multi-modal density, Isard and Blake 
(1998) dubbed their algorithm CONditional DENSity propagATION or CONDENSATION. 

Figure 7.42a shows what a factored sample of a head tracker might look like, drawing 
a red B-spline contour for each of (a subset of) the particles being tracked. Figure 7.42b 
shows why the measurement density itself is often multi-modal: the locations of the edges 


12 Alternatives to modeling multi-modal distributions include mixtures of Gaussians (Section 5.2.2) and multiple 
hypothesis tracking (Bar-Shalom and Fortmann 1988; Cham and Rehg 1999). 
'3Note that because of the structure of these steps, non-linear dynamics and non-Gaussian noise can be used. 
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Figure 7.43 Intelligent scissors: (a) as the mouse traces the white path, the scissors follow 
the orange path along the object boundary (the green curves show intermediate positions) 
(Mortensen and Barrett 1995) © 1995 ACM; (b) regular scissors can sometimes jump to a 
stronger (incorrect) boundary; (c) after training to the previous segment, similar edge profiles 
are preferred (Mortensen and Barrett 1998) O 1995 Elsevier. 


perpendicular to the spline curve can have multiple local maxima due to background clutter. 
Finally, Figure 7.42c shows the temporal evolution of the conditional density (x coordinate 
of the head and shoulder tracker centroid) as it tracks several people over time. 


Scissors 


Active contours allow a user to roughly specify a boundary of interest and have the system 
evolve the contour towards a more accurate location as well as track it over time. The results 
of this curve evolution, however, may be unpredictable and may require additional user-based 
hints to achieve the desired result. 

An alternative approach is to have the system optimize the contour in real time as the 
user is drawing (Mortensen 1999). The intelligent scissors system developed by Mortensen 
and Barrett (1995) does just that. As the user draws a rough outline (the white curve in 
Figure 7.43a), the system computes and draws a better curve that clings to high-contrast 
edges (the orange curve). 

To compute the optimal curve path (live-wire), the image is first pre-processed to associate 
low costs with edges (links between neighboring horizontal, vertical, and diagonal, i.e., Vg 
neighbors) that are likely to be boundary elements. Their system uses a combination of zero- 
crossing, gradient magnitudes, and gradient orientations to compute these costs. 

Next, as the user traces a rough curve, the system continuously recomputes the lowest- 
cost path between the starting seed point and the current mouse location using Dijkstra’s al- 
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gorithm, a breadth-first dynamic programming algorithm that terminates at the current target 
location. 

In order to keep the system from jumping around unpredictably, the system will “freeze” 
the curve to date (reset the seed point) after a period of inactivity. To prevent the live wire 
from jumping onto adjacent higher-contrast contours, the system also “learns” the intensity 
profile under the current optimized curve, and uses this to preferentially keep the wire moving 
along the same (or a similar looking) boundary (Figure 7.43b-c). 

Several extensions have been proposed to the basic algorithm, which works remarkably 
well even in its original form. Mortensen and Barrett (1999) use tobogganing, which is a 
simple form of watershed region segmentation, to pre-segment the image into regions whose 
boundaries become candidates for optimized curve paths. The resulting region boundaries 
are turned into a much smaller graph, where nodes are located wherever three or four regions 
meet. The Dijkstra algorithm is then run on this reduced graph, resulting in much faster (and 
often more stable) performance. Another extension to intelligent scissors is to use a proba- 
bilistic framework that takes into account the current trajectory of the boundary, resulting in 
a system called JetStream (Pérez, Blake, and Gangnet 2001). 

Instead of re-computing an optimal curve at each time instant, a simpler system can be 
developed by simply “snapping” the current mouse position to the nearest likely boundary 
point (Gleicher 1995). Applications of these boundary extraction techniques to image cutting 


and pasting are presented in Section 10.4. 


7.3.2 Level Sets 


A limitation of active contours based on parametric curves of the form f(s), e.g., snakes, B- 
snakes, and CONDENSATION, is that it is challenging to change the topology of the curve 
as it evolves (McInerney and Terzopoulos 1999, 2000). Furthermore, if the shape changes 
dramatically, curve reparameterization may also be required. 

An alternative representation for such closed contours is to use a level set, where the zero- 
crossing(s) of a characteristic (or signed distance (Section 3.3.3)) function define the curve. 
Level sets evolve to fit and track objects of interest by modifying the underlying embed- 
ding function (another name for this 2D function) $(«, y) instead of the curve f(s) (Malladi, 
Sethian, and Vemuri 1995; Sethian 1999; Sapiro 2001; Osher and Paragios 2003). To re- 
duce the amount of computation required, only a small strip (frontier) around the locations of 
the current zero-crossing needs to updated at each step, which results in what are called fast 
marching methods (Sethian 1999). 


An example of an evolution equation is the geodesic active contour proposed by Caselles, 
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Figure 7.44 Level set evolution for a geodesic active contour. The embedding function 
@ is updated based on the curvature of the underlying surface modulated by the edge/speed 
function g(I), as well as the gradient of g(1I), thereby attracting it to strong edges. 


Kimmel, and Sapiro (1997) and Yezzi, Kichenassamy el al. (1997), 


I ae Vé 
Pr |Vġ]div (an E) 
= g(I)|Vdldiv (22) + Vg(1) -Vo (7.38) 
[Vo] 


where g(1) is a generalized version of the snake edge potential. To get an intuitive sense 
of the curve’s behavior, assume that the embedding function ¢ is a signed distance function 
away from the curve (Figure 7.44), in which case |¢| = 1. The first term in Equation (7.38) 
moves the curve in the direction of its curvature, i.e., it acts to straighten the curve, under 
the influence of the modulation function g(I). The second term moves the curve down the 
gradient of g(I), encouraging the curve to migrate towards minima of g(J). 

While this level-set formulation can readily change topology, it is still susceptible to lo- 
cal minima, since it is based on local measurements such as image gradients. An alternative 
approach is to re-cast the problem in a segmentation framework, where the energy measures 
the consistency of the image statistics (e.g., color, texture, motion) inside and outside the seg- 
mented regions (Cremers, Rousson, and Deriche 2007; Rousson and Paragios 2008; Houhou, 
Thiran, and Bresson 2008). These approaches build on earlier energy-based segmentation 
frameworks introduced by Leclerc (1989), Mumford and Shah (1989), and Chan and Vese 
(2001), which are discussed in more detail in Section 4.3.2. 

For more information on level sets and their applications, please see the collection of 
papers edited by Osher and Paragios (2003) as well as the series of Workshops on Variational 
and Level Set Methods in Computer Vision (Paragios, Faugeras et al. 2005) and Special 


Issues on Scale Space and Variational Methods in Computer Vision (Paragios and Sgallari 
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Figure 7.45 Keyframe-based rotoscoping (Agarwala, Hertzmann et al. 2004) © 2004 
ACM: (a) original frames; (b) rotoscoped contours; (c) re-colored blouse; (d) rotoscoped 
hand-drawn animation. 


2009). 


7.3.3 Application: Contour tracking and rotoscoping 


Active contours can be used in a wide variety of object-tracking applications (Blake and 
Isard 1998; Yilmaz, Javed, and Shah 2006). For example, they can be used to track facial 
features for performance-driven animation (Terzopoulos and Waters 1990; Lee, Terzopoulos, 
and Waters 1995; Parke and Waters 1996; Bregler, Covell, and Slaney 1997). They can also 
be used to track heads and people, as shown in Figure 7.42, as well as moving vehicles 
(Paragios and Deriche 2000). Additional applications include medical image segmentation, 
where contours can be tracked from slice to slice in computed tomography (Cootes and Taylor 
2001), or over time, as in ultrasound scans. 

An interesting application that is closer to computer animation and visual effects is ro- 
toscoping, which uses the tracked contours to deform a set of hand-drawn animations (or to 
modify or replace the original video frames).'* Agarwala, Hertzmann et al. (2004) present a 
system based on tracking hand-drawn B-spline contours drawn at selected keyframes, using 
a combination of geometric and appearance-based criteria (Figure 7.45). They also provide 
an excellent review of previous rotoscoping and image-based, contour-tracking systems. 


Additional applications of rotoscoping (object contour detection and segmentation), such 


14The term comes from a device (a rotoscope) that projected frames of a live-action film underneath an acetate so 


that artists could draw animations directly over the actors’ shapes. 
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as cutting and pasting objects from one photograph into another, are presented in Section 10.4. 


7.4 Lines and vanishing points 


While edges and general curves are suitable for describing the contours of natural objects, 
the human-made world is full of straight lines. Detecting and matching these lines can be 
useful in a variety of applications, including architectural modeling, pose estimation in urban 
environments, and the analysis of printed document layouts. 

In this section, we present some techniques for extracting piecewise linear descriptions 
from the curves computed in the previous section. We begin with some algorithms for approx- 
imating a curve as a piecewise-linear polyline. We then describe the Hough transform, which 
can be used to group edgels into line segments even across gaps and occlusions. Finally, we 
describe how 3D lines with common vanishing points can be grouped together. These van- 
ishing points can be used to calibrate a camera and to determine its orientation relative to a 


rectahedral scene, as described in Section 11.1.1. 


7.4.1 Successive approximation 


As we saw in Section 7.2.2, describing a curve as a series of 2D locations x; = x(s;) provides 
a general representation suitable for matching and further processing. In many applications, 
however, it is preferable to approximate such a curve with a simpler representation, e.g., as a 
piecewise-linear polyline or as a B-spline curve (Farin 2002). 

Many techniques have been developed over the years to perform this approximation, 
which is also known as line simplification. One of the oldest, and simplest, is the one proposed 
by Ramer (1972) and Douglas and Peucker (1973), who recursively subdivide the curve at 
the point furthest away from the line joining the two endpoints (or the current coarse polyline 
approximation). Hershberger and Snoeyink (1992) provide a more efficient implementation 
and also cite some of the other related work in this area. 

Once the line simplification has been computed, it can be used to approximate the orig- 
inal curve. If a smoother representation or visualization is desired, either approximating or 
interpolating splines or curves can be used (Sections 3.5.1 and 7.3.1) (Szeliski and Ito 1986; 
Bartels, Beatty, and Barsky 1987; Farin 2002). 


7.4.2 Hough transforms 


While curve approximation with polylines can often lead to successful line extraction, lines 
in the real world are sometimes broken up into disconnected components or made up of many 
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Figure 7.46 Original Hough transform: (a) each point votes for a complete family of 
potential lines r;(0) = x; cos + y; sin 9; (b) each pencil of lines sweeps out a sinusoid in 


(r, 0); their intersection provides the desired line equation. 
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Figure 7.47 Oriented Hough transform: (a) an edgel re-parameterized in polar (r, 0) co- 
ordinates, with fi; = (cos 0,, sin 0,) and ri = fi; - x;; (b) (r, 0) accumulator array, showing 


the votes for the three edgels marked in red, green, and blue. 


collinear line segments. In many cases, it is desirable to group such collinear segments into 
extended lines. At a further processing stage (described in Section 7.4.3), we can then group 
such lines into collections with common vanishing points. 

The Hough transform, named after its original inventor (Hough 1962), is a well-known 
technique for having edges “vote” for plausible line locations (Duda and Hart 1972; Ballard 
1981; Illingworth and Kittler 1988). In its original formulation (Figure 7.46), each edge point 
votes for all possible lines passing through it, and lines corresponding to high accumulator or 
bin values are examined for potential line fits.'* Unless the points on a line are truly punctate, 
a better approach is to use the local orientation information at each edgel to vote for a single 


accumulator cell (Figure 7.47), as described below. A hybrid strategy, where each edgel votes 


'SThe Hough transform can also be generalized to look for other geometric features, such as circles (Ballard 
1981). 
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Figure 7.48 2D line equation expressed in terms of the normal fi and distance to the origin 


d. 


for a number of possible orientation or location pairs centered around the estimate orientation, 
may be desirable in some cases. 

Before we can vote for line hypotheses, we must first choose a suitable representation. 
Figure 7.48 (copied from Figure 2.2a) shows the normal-distance (fi, d) parameterization for 
a line. Since lines are made up of edge segments, we adopt the convention that the line normal 
fi points in the same direction (i.e., has the same sign) as the image gradient J(x) = VI(x) 
(7.19). To obtain a minimal two-parameter representation for lines, we convert the normal 
vector into an angle 

0 = tan? Ny/Nz; (7.39) 


as shown in Figure 7.48. The range of possible (0, d) values is [—180°, 180°] x [-V2, V2], 
assuming that we are using normalized pixel coordinates (2.61) that lie in [—1, 1]. The number 
of bins to use along each axis depends on the accuracy of the position and orientation estimate 
available at each edgel and the expected line density, and is best set experimentally with some 
test runs on sample imagery. 

There are a lot of details in getting the Hough transform to work well, including using 
edge segment lengths or strengths during the voting process, keeping a list of constituent 
edgels in the accumulator array for easier post-processing, and optionally combining edges 
of different “polarity” into the same line segments. These are best worked out by writing an 
implementation and testing it out on sample data. 

An alternative to the 2D polar (0, d) representation for lines is to use the full 3D m = 
(ñ, d) line equation, projected onto the unit sphere. While the sphere can be parameterized 


using spherical coordinates (2.8), 
mM = (cos 8 cos ¢, sin 0 cos ¢, sin d), (7.40) 


this does not uniformly sample the sphere and still requires the use of trigonometry. 
An alternative representation can be obtained by using a cube map, i.e., projecting m onto 


the face of a unit cube (Figure 7.49a). To compute the cube map coordinate of a 3D vector 
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Figure 7.49 Cube map representation for line equations and vanishing points: (a) a cube 
map surrounding the unit sphere; (b) projecting the half-cube onto three subspaces (Tuyte- 
laars, Van Gool, and Proesmans 1997) © 1997 IEEE. 


m, first find the largest (absolute value) component of m, i.e., m = +max(|n,|, [ny!, |d|), 
and use this to select one of the six cube faces. Divide the remaining two coordinates by m 
and use these as indices into the cube face. While this avoids the use of trigonometry, it does 
require some decision logic. 

One advantage of using the cube map, first pointed out by Tuytelaars, Van Gool, and 
Proesmans (1997), is that all of the lines passing through a point correspond to line segments 
on the cube faces, which is useful if the original (full voting) variant of the Hough transform 
is being used. In their work, they represent the line equation as ax + b+ y = 0, which 
does not treat the x and y axes symmetrically. Note that if we restrict d > 0 by ignoring the 
polarity of the edge orientation (gradient sign), we can use a half-cube instead, which can be 
represented using only three cube faces, as shown in Figure 7.49b (Tuytelaars, Van Gool, and 
Proesmans 1997). 


RANSAC-based line detection. Another alternative to the Hough transform is the RAN- 
dom SAmple Consensus (RANSAC) algorithm described in more detail in Section 8.1.4. In 
brief, RANSAC randomly chooses pairs of edgels to form a line hypothesis and then tests 
how many other edgels fall onto this line. (If the edge orientations are accurate enough, a 
single edgel can produce this hypothesis.) Lines with sufficiently large numbers of inliers 
(matching edgels) are then selected as the desired line segments. 

An advantage of RANSAC is that no accumulator array is needed, so the algorithm can 
be more space efficient and potentially less prone to the choice of bin size. The disadvantage 
is that many more hypotheses may need to be generated and tested than those obtained by 


finding peaks in the accumulator array. 


Bottom-up grouping. Yet another approach to line segment detection is to iteratively group 


edgels with similar orientations into oriented rectangular line-support regions (Burns, Han- 
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Figure 7.50 Real-world vanishing points: (a) architecture (Sinha, Steedly et al. 2008), (b) 
furniture (Miéustk, Wildenauer, and Kosecká 2008) O 2008 IEEE, and (c) calibration patterns 
(Zhang 2000). 


son, and Riseman 1986). The validity of such regions can then be determined using a statisti- 
cal analysis, as described in the LSD paper by Grompone von Gioi, Jakubowicz et al. (2008). 
The resulting algorithm is quite fast, does a good job of distinguishing line segments from 
texture, and is widely used in practice because of its performance and open source availabil- 
ity. Recently, deep neural network algorithms have been developed to simultaneously extract 
line segments and their junctions (Huang, Wang et al. 2018; Zhang, Li et al. 2019; Huang, 
Qin et al. 2020; Lin, Pintea, and van Gemert 2020). 

In general, there is no clear consensus on which line estimation technique performs best. 
It is therefore a good idea to think carefully about the problem at hand and to implement 
several approaches (successive approximation, Hough, and RANSAC) to determine the one 
that works best for your application. 


7.4.3 Vanishing points 


In many scenes, structurally important lines have the same vanishing point because they are 
parallel in 3D. Examples of such lines are horizontal and vertical building edges, zebra cross- 
ings, railway tracks, the edges of furniture such as tables and dressers, and of course, the 
ubiquitous calibration pattern (Figure 7.50). Finding the vanishing points common to such 
line sets can help refine their position in the image and, in certain cases, help determine the 
intrinsic and extrinsic orientation of the camera (Section 11.1.1). 

Over the years, a large number of techniques have been developed for finding vanishing 
points (Quan and Mohr 1989; Collins and Weiss 1990; Brillaut-O’Mahoney 1991; McLean 
and Kotturi 1995; Becker and Bove 1995; Shufelt 1999; Tuytelaars, Van Gool, and Proesmans 
1997; Schaffalitzky and Zisserman 2000; Antone and Teller 2002; Rother 2002; KoSecka and 
Zhang 2005; Denis, Elder, and Estrada 2008; Pflugfelder 2008; Tardif 2009; Bazin, Seo et al. 
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Figure 7.51 Rectangle detection: (a) indoor corridor and (b) building exterior with 
grouped facades (Kosecká and Zhang 2005) O 2005 Elsevier; (c) grammar-based recognition 
(Han and Zhu 2005) © 2005 IEEE; (df) rectangle matching using a plane sweep algorithm 
(Micusik, Wildenauer, and KoSecká 2008) © 2008 IEEE. 


2012; Antunes and Barreto 2013; Kluger, Ackermann et al. 2017; Zhou, Qi et al. 2019a)—see 
some of the more recent papers for additional references and alternative approaches. 


In the first edition of this book (Szeliski 2010, Section 4.3.3), I presented a simple Hough 
technique based on having line pairs vote for potential vanishing point locations, followed 
by a robust least squares fitting stage. While my technique proceeds in two discrete stages, 
better results may be obtained by alternating between assigning lines to vanishing points and 
refitting the vanishing point locations (Antone and Teller 2002; KoSecka and Zhang 2005; 
Pflugfelder 2008). The results of detecting individual vanishing points can also be made 
more robust by simultaneously searching for pairs or triplets of mutually orthogonal vanishing 
points (Shufelt 1999; Antone and Teller 2002; Rother 2002; Sinha, Steedly et al. 2008; Li, 
Kim et al. 2020). Some results of such vanishing point detection algorithms can be seen in 
Figure 7.50. It is also possible to simultaneously detect line segments and their junctions 
using a neural network (Zhang, Li et al. 2019) and to then use these to construct complete 3D 
wireframe models (Zhou, Qi, and Ma 2019; Zhou, Qi et al. 2019b). 
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Rectangle detection 


Once sets of mutually orthogonal vanishing points have been detected, 1t now becomes pos- 
sible to search for 3D rectangular structures in the image (Figure 7.51). A variety of tech- 
niques have been developed to find such rectangles, primarily focused on architectural scenes 
(Kosecká and Zhang 2005; Han and Zhu 2005; Shaw and Barnes 2006; Micusik, Wildenauer, 
and KoSecka 2008; Schindler, Krishnamurthy et al. 2008). 

After detecting orthogonal vanishing directions, Kosecká and Zhang (2005) refine the 
fitted line equations, search for corners near line intersections, and then verify rectangle hy- 
potheses by rectifying the corresponding patches and looking for a preponderance of hori- 
zontal and vertical edges (Figures 7.51a—b). In follow-on work, Micusik, Wildenauer, and 
KoSecka (2008) use a Markov random field (MRF) to disambiguate between potentially over- 
lapping rectangle hypotheses. They also use a plane sweep algorithm to match rectangles 
between different views (Figures 7.51d-f). 

A different approach is proposed by Han and Zhu (2005), who use a grammar of potential 
rectangle shapes and nesting structures (between rectangles and vanishing points) to infer 
the most likely assignment of line segments to rectangles (Figure 7.51c). The idea of using 
regular, repetitive structures as part of the modeling process is now being called holistic 3D 
reconstruction (Zhou, Furukawa, and Ma 2019; Zhou, Furukawa et al. 2020; Pintore, Mura et 


al. 2020) and will be discussed in more detail in Section 13.6.1 on modeling 3D architecture. 


7.5 Segmentation 


Image segmentation is the task of finding groups of pixels that “go together”. In statistics and 
machine learning, this problem is known as cluster analysis or more simply clustering and is 
a widely studied area with hundreds of different algorithms (Jain and Dubes 1988; Kaufman 
and Rousseeuw 1990; Jain, Duin, and Mao 2000; Jain, Topchy et al. 2004; Xu and Wunsch 
2005). We’ve already discussed general vector-space clustering algorithms in Section 5.2.1. 
The main difference between clustering and segmentation is that the former usually ignores 
pixel layout and neighborhoods, while the latter relies heavily on spatial cues and constraints. 

In computer vision, image segmentation is one of the oldest and most widely studied prob- 
lems (Brice and Fennema 1970; Pavlidis 1977; Riseman and Arbib 1977; Ohlander, Price, 
and Reddy 1978; Rosenfeld and Davis 1979; Haralick and Shapiro 1985). Early techniques 
often used region splitting or merging (Brice and Fennema 1970; Horowitz and Pavlidis 1976; 
Ohlander, Price, and Reddy 1978; Pavlidis and Liow 1990), which correspond to divisive and 
agglomerative algorithms (Jain, Topchy et al. 2004; Xu and Wunsch 2005), which we intro- 
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duced in Section 5.2.1. More recent algorithms typically optimize some global criterion, such 
as intra-region consistency and inter-region boundary lengths or dissimilarity (Leclerc 1989; 
Mumford and Shah 1989; Shi and Malik 2000; Comaniciu and Meer 2002; Felzenszwalb and 
Huttenlocher 2004; Cremers, Rousson, and Deriche 2007; Pont-Tuset, Arbeláez et al. 2017). 

We have already seen examples of image segmentation using image morphology (Sec- 
tion 3.3.3), Markov random fields (Section 4.3), active contours (Section 7.3), and level sets 
(Section 7.3.2). In the recognition chapter (Section 6.4), we studied semantic segmentation, 
whose goal is to break the image up into semantically labeled regions such as sky, grass, 
and individual people and animals. In this section, we review some additional techniques for 
bottom-up general (non-semantic) image segmentation. These include algorithms based on 
region splitting and merging, graph-based segmentation, and probabilistic aggregation (Sec- 
tion 7.5.1), mean shift mode finding (Section 7.5.2), and normalized cuts splitting based on 
pixel similarity metrics (Section 7.5.3). Since many of these algorithms are no longer widely 
used, a lot of the descriptions have been considerably shortened from those found in the first 
edition of this book (Szeliski 2010, Chapter 5), where you can find longer descriptions. 

Since the literature on image segmentation is so vast, a good way to get a handle on 
some of the better performing algorithms is to look at experimental comparisons on human- 
labeled databases (Arbeláez, Maire et al. 2011; Pont-Tuset, Arbeláez et al. 2017). The best 
known of these is the Berkeley Segmentation Dataset and Benchmark (Martin, Fowlkes et 
al. 2001), which consists of 1,000 images from a Corel image dataset that were hand-labeled 
by 30 human subjects, for which Unnikrishnan, Pantofaru, and Hebert (2007) propose new 
metrics for comparing segmentation algorithms, while Estrada and Jepson (2009) compare 
four well-known segmentation algorithms. A newer database of foreground and background 
segmentations, used by Alpert, Galun et al. (2007), is also available. 

As mentioned in Section 3.3.3, the simplest possible technique for segmenting a grayscale 
image is to select a threshold and then compute connected components. Unfortunately, a 
single threshold is rarely sufficient for the whole image because of lighting and intra-object 


statistical variations. 


Region splitting (divisive clustering). Splitting the image into successively finer regions is 
one of the oldest techniques in computer vision. Ohlander, Price, and Reddy (1978) present 
such a technique, which first computes a histogram for the whole image and then finds a 
threshold that best separates the large peaks in the histogram. This process is repeated until 
regions are either fairly uniform or below a certain size. More recent splitting algorithms 
often optimize some metric of intra-region similarity and inter-region dissimilarity. These are 


covered in Sections 4.3.2 and Sections 7.5.3. 
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Region merging (agglomerative clustering). Region merging techniques also date back to 
the beginnings of computer vision. Brice and Fennema (1970) use a dual grid for representing 
boundaries between pixels and merge regions based on their relative boundary lengths and the 
strength of the visible edges at these boundaries. 


A very simple version of pixel-based merging combines adjacent regions whose average 
color difference is below a threshold or whose regions are too small. Segmenting the image 
into such superpixels (Mori, Ren et al. 2004), which are not semantically meaningful, can be a 
useful pre-processing stage to make higher-level algorithms such as stereo matching (Zitnick, 
Kang et al. 2004; Taguchi, Wilburn, and Zitnick 2008), optical flow (Zitnick, Jojic, and Kang 
2005; Brox, Bregler, and Malik 2009), and recognition (Mori, Ren et al. 2004; Mori 2005; Gu, 
Lim et al. 2009; Lim, Arbeláez et al. 2009) both faster and more robust. It is also possible 
to combine both splitting and merging by starting with a medium-grain segmentation (in a 
quadtree representation) and then allowing both merging and splitting operations (Horowitz 
and Pavlidis 1976; Pavlidis and Liow 1990). 


Watershed. A technique related to thresholding, since it operates on a grayscale image, 
is watershed computation (Vincent and Soille 1991). This technique segments an image 
into several catchment basins, which are the regions of an image (interpreted as a height 
field or landscape) where rain would flow into the same lake. An efficient way to compute 
such regions is to start flooding the landscape at all of the local minima and to label ridges 
wherever differently evolving components meet. The whole algorithm can be implemented 


using a priority queue of pixels and breadth-first search (Vincent and Soille 1991).!° 


Since images rarely have dark regions separated by lighter ridges, watershed segmentation 
is usually applied to a smoothed version of the gradient magnitude image, which also makes it 
usable with color images. As an alternative, the maximum oriented energy in a steerable filter 
(3.28-3.29) (Freeman and Adelson 1991) can be used as the basis of the oriented watershed 
transform developed by Arbeláez, Maire et al. (2011). Such techniques end up finding smooth 
regions separated by visible (higher gradient) boundaries. Since such boundaries are what 
active contours usually follow, active contour algorithms (Mortensen and Barrett 1999; Li, 
Sun et al. 2004) often precompute such a segmentation using either the watershed or the 
related tobogganing technique (Section 7.3.1). 


16A related algorithm can be used to compute maximally stable extremal regions (MSERs) efficiently (Sec- 
tion 7.1.1) (Nistér and Stewénius 2008). 


486 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


(a) 


Figure 7.52 Graph-based merging segmentation (Felzenszwalb and Huttenlocher 2004) 
© 2004 Springer: (a) input grayscale image that is successfully segmented into three regions 
even though the variation inside the smaller rectangle is larger than the variation across 
the middle edge; (b) input grayscale image; (c) resulting segmentation using an Ng pixel 


neighborhood. 


7.5.1 Graph-based segmentation 


While many merging algorithms simply apply a fixed rule that groups pixels and regions 
together, Felzenszwalb and Huttenlocher (2004) present a merging algorithm that uses rela- 
tive dissimilarities between regions to determine which ones should be merged; it produces 
an algorithm that provably optimizes a global grouping metric. They start with a pixel-to- 
pixel dissimilarity measure w(e) that measures, for example, intensity differences between 
Ng neighbors. Alternatively, they can use the joint feature space distances introduced by Co- 
maniciu and Meer (2002), which we discuss in Sections 7.5.2 and 7.5.3. Figure 7.52 shows 


two examples of images segmented using their technique. 


Probabilistic aggregation 


Alpert, Galun et al. (2007) develop a probabilistic merging algorithm based on two cues, 
namely gray-level similarity and texture similarity. The gray-level similarity between regions 
R; and Rj is based on the minimal external difference from other neighboring regions, which 
is compared to the average intensity difference to compute the likelihoods p;; that two regions 
should be merged. Merging proceeds in a hierarchical fashion inspired by algebraic multigrid 
techniques (Brandt 1986; Briggs, Henson, and McCormick 2000) and previously used by 
Alpert, Galun et al. (2007) in their segmentation by weighted aggregation (SWA) algorithm 
(Sharon, Galun et al. 2006). Figure 7.56 shows the segmentations produced by this algorithm 
compared to other popular segmentation algorithms. 
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7.5.2 Mean shift 


Mean-shift and mode finding techniques, such as k-means and mixtures of Gaussians, model 
the feature vectors associated with each pixel (e.g., color and position) as samples from an 
unknown probability density function and then try to find clusters (modes) in this distribution. 

Consider the color image shown in Figure 7.53a. How would you segment this image 
based on color alone? Figure 7.53b shows the distribution of pixels in L*u*v* space, which 
1s equivalent to what a vision algorithm that ignores spatial location would see. To make the 
visualization simpler, let us only consider the L*u* coordinates, as shown in Figure 7.53c. 
How many obvious (elongated) clusters do you see? How would you go about finding these 
clusters? 

The k-means and mixtures of Gaussians techniques we studied in Section 5.2.2 use a 
parametric model of the density function to answer this question, i.e., they assume the den- 
sity is the superposition of a small number of simpler distributions (e.g., Gaussians) whose 
locations (centers) and shape (covariance) can be estimated. Mean shift, on the other hand, 
smoothes the distribution and finds its peaks as well as the regions of feature space that cor- 
respond to each peak. Since a complete density is being modeled, this approach is called 
non-parametric (Bishop 2006). 

The key to mean shift is a technique for efficiently finding peaks in this high-dimensional 
data distribution without ever computing the complete function explicitly (Fukunaga and 
Hostetler 1975; Cheng 1995; Comaniciu and Meer 2002). Consider once again the data points 
shown in Figure 7.53c, which can be thought of as having been drawn from some probability 
density function. If we could compute this density function, as visualized in Figure 7.53e, we 
could find its major peaks (modes) and identify regions of the input space that climb to the 
same peak as being part of the same region. This is the inverse of the watershed algorithm 
described in Section 7.5, which climbs downhill to find basins of attraction. 

The first question, then, is how to estimate the density function given a sparse set of 
samples. One of the simplest approaches is to just smooth the data, e.g., by convolving it with 
a fixed kernel of width h, which, as we saw in Section 4.1.1, is the Parzen window approach 
to density estimation (Duda, Hart, and Stork 2001, Section 4.3; Bishop 2006, Section 2.5.1). 
Once we have computed f(x), as shown in Figure 7.53e, we can find its local maxima using 
gradient ascent or some other optimization technique. 

The problem with this “brute force” approach is that, for higher dimensions, it becomes 
computationally prohibitive to evaluate f(x) over the complete search space. Instead, mean 
shift uses a variant of what is known in the optimization literature as multiple restart gradient 
descent. Starting at some guess for a local maximum, yz, which can be a random input data 


point x;, mean shift computes the gradient of the density estimate f(x) at yx and takes an 
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NORMALIZED DENSITY 


(e) 


Figure 7.53 Mean-shift image segmentation (Comaniciu and Meer 2002) © 2002 IEEE: 
(a) input color image; (b) pixels plotted in L*u*v* space; (c) L*u* space distribution; (d) 
clustered results after 159 mean-shift procedures; (e) corresponding trajectories with peaks 


marked as red dots. 
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uphill step in that direction. Details on how this can be done efficiently can be found in papers 
on mean shift (Comaniciu and Meer 2002; Paris and Durand 2007) as well as the first edition 
of this book (Szeliski 2010, Section 5.3.2). 

The color-based segmentation shown in Figure 7.53 only looks at pixel colors when deter- 
mining the best clustering. It may therefore cluster together small isolated pixels that happen 
to have the same color, which may not correspond to a semantically meaningful segmentation 
of the image. Better results can usually be obtained by clustering in the joint domain of color 
and location. In this approach, the spatial coordinates of the image x, = (a, y), which are 
called the spatial domain, are concatenated with the color values x,., which are known as the 
range domain, and mean-shift clustering is applied in this five-dimensional space x;. Since 
location and color may have different scales, the kernels are adjusted separately, just as in the 
bilateral filter kernel (3.34-3.37) discussed in Section 3.3.2. The difference between mean 
shift and bilateral filtering, however, is that in mean shift, the spatial coordinates of each pixel 
are adjusted along with its color values, so that the pixel migrates more quickly towards other 
pixels with similar colors, and can therefore later be used for clustering and segmentation. 

Mean shift has been applied to a number of different problems in computer vision, in- 
cluding face tracking, 2D shape extraction, and texture segmentation (Comaniciu and Meer 
2002), stereo matching (Wei and Quan 2004), non-photorealistic rendering (Section 10.5.2) 
(DeCarlo and Santella 2002), and video editing (Section 10.4.5) (Wang, Bhat et al. 2005). 
Paris and Durand (2007) provide a nice review of such applications, as well as techniques for 


more efficiently solving the mean-shift equations and producing hierarchical segmentations. 


7.5.3 Normalized cuts 


While bottom-up merging techniques aggregate regions into coherent wholes and mean-shift 
techniques try to find clusters of similar pixels using mode finding, the normalized cuts 
technique introduced by Shi and Malik (2000) examines the affinities (similarities) between 
nearby pixels and tries to separate groups that are connected by weak affinities. 

Consider the simple graph shown in Figure 7.54a. The pixels in group A are all strongly 
connected with high affinities, shown as thick red lines, as are the pixels in group B. The 
connections between these two groups, shown as thinner blue lines, are much weaker. A 
normalized cut between the two groups, shown as a dashed line, separates them into two 
clusters. 

The cut between two groups A and B is defined as the sum of all the weights being cut, 
where the weights between two pixels (or regions) 7 and j measure their similarity. Using 
a minimum cut as a segmentation criterion, however, does not result in reasonable clusters, 


since the smallest cuts usually involve isolating a single pixel. 
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A B sum 


A | assoc(A, A) | cut(A,B) | assoc(A,V) 


B| cut(B,A) | assoc(B, B) | assoc(B,V) 


sum | assoc(A,V) | assoc(B,v) 


(a) (b) 


Figure 7.54 Sample weighted graph and its normalized cut: (a) a small sample graph 
and its smallest normalized cut; (b) tabular form of the associations and cuts for this graph. 
The assoc and cut entries are computed as area sums of the associated weight matrix W. 
Normalizing the table entries by the row or column sums produces normalized associations 


and cuts Nassoc and Ncut. 


A better measure of segmentation is the normalized cut, which is defined as 


cut(A, B) cut(A, B) 


Neut(A, B) = assoc(A,V) | assoc(B,V)’ 


(7.41) 


where assoc(A, A) = Mica jea Wij is the association (sum of all the weights) within a 
cluster and assoc(A, V) = assoc(A, A) + cut(A, B) is the sum of all the weights associated 
with nodes in A. Figure 7.54b shows how the cuts and associations can be thought of as area 
sums in the weight matrix W = [w;;], where the entries of the matrix have been arranged 
so that the nodes in A come first and the nodes in B come second. Dividing each of these 
areas by the corresponding row sum (the rightmost column of Figure 7.54b) results in the 
normalized cut and association values. These normalized values better reflect the fitness of a 
particular segmentation, since they look for collections of edges that are weak relative to all 
of the edges both inside and emanating from a particular region. 

Unfortunately, computing the optimal normalized cut is NP-complete. Instead, Shi and 
Malik (2000) suggest computing a real-valued assignment of nodes to groups, using a general- 
ized eigenvalue analysis of the normalized affinity matrix (Weiss 1999), as described in more 
detail in the normalized cuts paper and (Szeliski 2010, Section 5.4). Because these eigenvec- 
tors can be interpreted as the large modes of vibration in a spring-mass system, normalized 
cuts is an example of a spectral method for image segmentation. After the real-valued eigen- 
vector is computed, the variables corresponding to positive and negative eigenvector values 
are associated with the two cut components. This process can be further repeated to hierar- 
chically subdivide an image, as shown in Figure 7.55. 


The original algorithm proposed by Shi and Malik (2000) used spatial position and image 
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Figure 7.55 Normalized cuts segmentation (Shi and Malik 2000) O 2000 IEEE: The input 


image and the components returned by the normalized cuts algorithm. 


feature differences to compute the pixel-wise affinities. In subsequent work, Malik, Belongie 
et al. (2001) look for intervening contours between pixels ¿ and j to define intervening contour 
weights and then multiply these weights with a texton-based texture similarity metric. They 
then use an initial over-segmentation based purely on local pixel-wise features to re-estimate 
intervening contours and texture statistics in a region-based manner. Figure 7.56 shows the 
results of running this improved algorithm on a number of test images. 

Because it requires the solution of large sparse eigenvalue problems, normalized cuts can 
be quite slow. Sharon, Galun et al. (2006) present a way to accelerate the computation of 
the normalized cuts using an approach inspired by algebraic multigrid (Brandt 1986; Briggs, 
Henson, and McCormick 2000). 

An example of the segmentation produced by weighted aggregation (SWA) is shown in 
Figure 7.56, along with the most recent probabilistic bottom-up merging algorithm by Alpert, 
Galun et al. (2007). In more recent work, Pont-Tuset, Arbeláez et al. (2017) speed up nor- 
malized cuts and extend it to multiple scales to obtain state-of-the-art results on both the 
Berkeley Segmentation Dataset as well as (at the time) object proposals on the VOC and 
COCO datasets. 


7.6 Additional reading 


One of the seminal papers on feature detection, description, and matching is by Lowe (2004). 
Comprehensive surveys and evaluations of such techniques have been made by Schmid, 
Mohr, and Bauckhage (2000), Mikolajezyk and Schmid (2005), Mikolajczyk, Tuytelaars et 
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Our method SWA V1 Normalized cuts Mean-shift 


Figure 7.56 Comparative segmentation results (Alpert, Galun et al. 2007) O 2007 IEEE. 
“Our method” refers to the probabilistic bottom-up merging algorithm developed by Alpert 


Original image 


et al. 


al. (2005), and Tuytelaars and Mikolajczyk (2008), while Shi and Tomasi (1994) and Triggs 
(2004) also provide nice reviews. 

In the area of feature detectors (Mikolajczyk, Tuytelaars et al. 2005), in addition to such 
classic approaches as Fórstner—Harris (Fórstner 1986; Harris and Stephens 1988) and differ- 
ence of Gaussians (Lindeberg 1993, 1998b; Lowe 2004), maximally stable extremal regions 
(MSERs) are widely used for applications that require affine invariance (Matas, Chum et al. 
2004; Nistér and Stewénius 2008). More recent interest point detectors are discussed by Xiao 
and Shah (2003), Koethe (2003), Carneiro and Jepson (2005), Kenney, Zuliani, and Manju- 
nath (2005), Bay, Ess et al. (2008), Platel, Balmachnova et al. (2006), and Rosten, Porter, 
and Drummond (2010), as are techniques based on line matching (Zoghlami, Faugeras, and 
Deriche 1997; Bartoli, Coquerelle, and Sturm 2004) and region detection (Kadir, Zisserman, 
and Brady 2004; Matas, Chum et al. 2004; Tuytelaars and Van Gool 2004; Corso and Hager 
2005). Three recent papers with nice reviews of DNN-based feature detectors are Balntas, 
Lenc et al. (2020), Barroso-Laguna, Riba et al. (2019), and Tian, Balntas et al. (2020). 

A variety of local feature descriptors (and matching heuristics) are surveyed and com- 
pared by Mikolajczyk and Schmid (2005). More recent publications in this area include 
those by van de Weijer and Schmid (2006), Abdel-Hakim and Farag (2006), Winder and 
Brown (2007), and Hua, Brown, and Winder (2007) and the recent evaluations by Balntas, 
Lenc et al. (2020) and Jin, Mishkin et al. (2021). Techniques for efficiently matching features 
include k-d trees (Beis and Lowe 1999; Lowe 2004; Muja and Lowe 2009), pyramid matching 
kernels (Grauman and Darrell 2005), metric (vocabulary) trees (Nistér and Stewénius 2006), 
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variety of multi- dimensional hashing techniques (Shakhnarovich, Viola, and Darrell 2003; 
Torralba, Weiss, and Fergus 2008; Weiss, Torralba, and Fergus 2008; Kulis and Grauman 
2009; Raginsky and Lazebnik 2009), and product quantization (Jégou, Douze, and Schmid 
2010; Johnson, Douze, and Jégou 2021). A good review of large-scale systems for instance 
retrieval is Zheng, Yang, and Tian (2018). 


The classic reference on feature detection and tracking is Shi and Tomasi (1994). More 
recent work in this field has focused on learning better matching functions for specific features 
(Avidan 2001; Jurie and Dhome 2002; Williams, Blake, and Cipolla 2003; Lepetit and Fua 
2005; Lepetit, Pilet, and Fua 2006; Hinterstoisser, Benhimane et al. 2008; Rogez, Rihan et 
al. 2008; Ozuysal, Calonder et al. 2010). 


A highly cited and widely used edge detector is the one developed by Canny (1986). Al- 
ternative edge detectors as well as experimental comparisons can be found in publications 
by Nalwa and Binford (1986), Nalwa (1987), Deriche (1987), Freeman and Adelson (1991), 
Nalwa (1993), Heath, Sarkar et al. (1998), Crane (1997), Ritter and Wilson (2000), Bowyer, 
Kranenburg, and Dougherty (2001), Arbeláez, Maire et al. (2011), and Pont-Tuset, Arbeláez 
et al. (2017). The topic of scale selection in edge detection is nicely treated by Elder and 
Zucker (1998), while approaches to color and texture edge detection can be found in Ruzon 
and Tomasi (2001), Martin, Fowlkes, and Malik (2004), and Gevers, van de Weijer, and Stok- 
man (2006). Edge detectors have also been combined with region segmentation techniques 
to further improve the detection of semantically salient boundaries (Maire, Arbelaez ef al. 
2008; Arbeláez, Maire et al. 2011; Xiaofeng and Bo 2012; Pont-Tuset, Arbeláez et al. 2017). 
Edges linked into contours can be smoothed and manipulated for artistic effect (Lowe 1989; 
Finkelstein and Salesin 1994; Taubin 1995) and used for recognition (Belongie, Malik, and 
Puzicha 2002; Tek and Kimia 2003; Sebastian and Kimia 2005). 


The topic of active contours has a long history, beginning with the seminal work on 
snakes and other energy-minimizing variational methods (Kass, Witkin, and Terzopoulos 
1988; Cootes, Cooper et al. 1995; Blake and Isard 1998), continuing through techniques 
such as intelligent scissors (Mortensen and Barrett 1995, 1999; Pérez, Blake, and Gangnet 
2001), and culminating in level sets (Malladi, Sethian, and Vemuri 1995; Caselles, Kimmel, 
and Sapiro 1997; Sethian 1999; Paragios and Deriche 2000; Sapiro 2001; Osher and Paragios 
2003; Paragios, Faugeras et al. 2005; Cremers, Rousson, and Deriche 2007; Rousson and 
Paragios 2008; Paragios and Sgallari 2009), which are currently the most widely used active 
contour methods. 

An early, well-regarded paper on straight line extraction in images was written by Burns, 
Hanson, and Riseman (1986). Their idea of bottom-up line-support regions was extended 


by Grompone von Gioi, Jakubowicz et al. (2008) to construct the popular LSD line segment 
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detector. The literature on vanishing point detection is quite vast and still evolving (Quan 
and Mohr 1989; Collins and Weiss 1990; Brillaut-O’Mahoney 1991; McLean and Kotturi 
1995; Becker and Bove 1995; Shufelt 1999; Tuytelaars, Van Gool, and Proesmans 1997; 
Schaffalitzky and Zisserman 2000; Antone and Teller 2002; Rother 2002; Kosecká and Zhang 
2005; Denis, Elder, and Estrada 2008; Pflugfelder 2008; Tardif 2009; Bazin, Seo et al. 2012; 
Antunes and Barreto 2013; Zhou, Qi et al. 2019a). Simultaneous line and junction detection 
techniques have also been developed (Huang, Wang et al. 2018; Zhang, Li et al. 2019). 

The topic of image segmentation is closely related to clustering techniques, which are 
treated in a number of monographs and review articles (Jain and Dubes 1988; Kaufman and 
Rousseeuw 1990; Jain, Duin, and Mao 2000; Jain, Topchy et al. 2004). Some early segmenta- 
tion techniques include those described by Brice and Fennema (1970), Pavlidis (1977), Rise- 
man and Arbib (1977), Ohlander, Price, and Reddy (1978), Rosenfeld and Davis (1979), and 
Haralick and Shapiro (1985), while examples of newer techniques are developed by Leclerc 
(1989), Mumford and Shah (1989), Shi and Malik (2000), and Felzenszwalb and Hutten- 
locher (2004). 

Arbeláez, Maire et al. (2011) and Pont-Tuset, Arbeláez et al. (2017) provide good reviews 
of automatic segmentation techniques and compare their performance on the Berkeley Seg- 
mentation Dataset and Benchmark (Martin, Fowlkes et al. 2001).'” Additional comparison 
papers and databases include those by Unnikrishnan, Pantofaru, and Hebert (2007), Alpert, 
Galun et al. (2007), and Estrada and Jepson (2009). 

Techniques for segmenting images based on local pixel similarities combined with ag- 
gregation or splitting methods include watersheds (Vincent and Soille 1991; Beare 2006; Ar- 
beláez, Maire et al. 2011), region splitting (Ohlander, Price, and Reddy 1978), region merg- 
ing (Brice and Fennema 1970; Pavlidis and Liow 1990; Jain, Topchy et al. 2004), as well as 
graph-based and probabilistic multi-scale approaches (Felzenszwalb and Huttenlocher 2004; 
Alpert, Galun et al. 2007). 

Mean-shift algorithms, which find modes (peaks) in a density function representation of 
the pixels, are presented by Comaniciu and Meer (2002) and Paris and Durand (2007). Para- 
metric mixtures of Gaussians can also be used to represent and segment such pixel densities 
(Bishop 2006; Ma, Derksen et al. 2007). 

The seminal work on spectral (eigenvalue) methods for image segmentation is the nor- 
malized cut algorithm of Shi and Malik (2000). Related work includes that by Weiss (1999), 
Meilá and Shi (2000), Meilá and Shi (2001), Malik, Belongie et al. (2001), Ng, Jordan, and 
Weiss (2001), Yu and Shi (2003), Cour, Bénézit, and Shi (2005), Sharon, Galun et al. (2006), 
Tolliver and Miller (2006), and Wang and Oliensis (2010). 


'T http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench. 
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7.7 Exercises 


Ex 7.1: Interest point detector. Implement one or more keypoint detectors and compare 
their performance (with your own or with a classmate’s detector). 
Possible detectors: 


e Laplacian or Difference of Gaussian; 
e Fórstner—Harris Hessian (try different formula variants given in (7.9—7.11)); 


e oriented/steerable filter, looking for either second-order high second response or two 


edges in a window (Koethe 2003), as discussed in Section 7.1.1. 
e any of the newer DNN-based detectors. 


Other detectors are described in Mikolajczyk, Tuytelaars et al. (2005), Tuytelaars and Miko- 
lajczyk (2008), and Balntas, Lenc et al. (2020). Additional optional steps could include: 


1. Compute the detections on a sub-octave pyramid and find 3D maxima. 


2. Find local orientation estimates using steerable filter responses or a gradient histogram- 
ming method. 


3. Implement non-maximal suppression, such as the adaptive technique of Brown, Szeliski, 
and Winder (2005). 


4. Vary the window shape and size (prefilter and aggregation). 


To test for repeatability, download the code from https://www.robots.ox.ac.uk/~vgg/research/ 
affine (Mikolajczyk, Tuytelaars et al. 2005; Tuytelaars and Mikolajczyk 2008) or simply 
rotate or shear your own test images. (Pick a domain you may want to use later, e.g., for 
outdoor stitching.) 


Be sure to measure and report the stability of your scale and orientation estimates. 


Ex 7.2: Interest point descriptor. Implement two or more descriptors from Section 7.1.2 
(steered to local scale and orientation estimates, if appropriate) and compare their perfor- 
mance on some images of your own choosing. 

You can either use the evaluation methodologies (and optionally software) described in 
Mikolajczyk and Schmid (2005), Balntas, Lenc et al. (2020), or Jin, Mishkin et al. (2021). 


Ex 7.3: ROC curve computation. Given a pair of curves (histograms) plotting the number 


of matching and non-matching features as a function of Euclidean distance d as shown in 
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Figure 7.22b, derive an algorithm for plotting a ROC curve (Figure 7.22a). In particular, let 
t(d) be the distribution of true matches and f(d) be the distribution of (false) non-matches. 
Write down the equations for the ROC, i.e., TPR(FPR), and the AUC. 

(Hint: Plot the cumulative distributions T(d) = f t(d) and F(d) = f f(d) and see if 
these help you derive the TPR and FPR at a given threshold 0.) 


Ex 7.4: Feature matcher. After extracting features from a collection of overlapping or dis- 
torted images,!* match them up by their descriptors either using nearest neighbor matching 
or a more efficient matching strategy such as a k-d tree. 

See whether you can improve the accuracy of your matches using techniques such as the 


nearest neighbor distance ratio. 


Ex 7.5: Feature tracker. Instead of finding feature points independently in multiple im- 
ages and then matching them, find features in the first image of a video or image sequence 
and then re-locate the corresponding points in the next frames using either search and gradi- 
ent descent (Shi and Tomasi 1994) or learned feature detectors (Lepetit, Pilet, and Fua 2006; 
Fossati, Dimitrijevic et al. 2007). When the number of tracked points drops below a threshold 
or new regions in the image become visible, find additional points to track. 

(Optional) Winnow out incorrect matches by estimating a homography (8.19-8.23) or 
fundamental matrix (Section 11.3.3). 

(Optional) Refine the accuracy of your matches using the iterative registration algorithm 


described in Section 9.2 and Exercise 9.2. 


Ex 7.6: Facial feature tracker. Apply your feature tracker to tracking points on a person’s 
face, either manually initialized to interesting locations such as eye corners or automatically 
initialized at interest points. 

(Optional) Match features between two people and use these features to perform image 
morphing (Exercise 3.25). 


Ex 7.7: Edge detector. Implement an edge detector of your choice. Compare its perfor- 
mance to that of your classmates’ detectors or code downloaded from the internet. 


A simple but well-performing sub-pixel edge detector can be created as follows: 
1. Blur the input image a little, 


B(x) = Go(x) * I(x). 


'8https://www.robots.ox.ac.uk/~vgg/research/affine. 
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2. Construct a Gaussian pyramid (Exercise 3.17), 
P = Pyramid{B,(x)} 


3. Subtract an interpolated coarser-level pyramid image from the original resolution blurred 
image, 
S(x) = B,(x) — P.InterpolatedLevel(L). 


4. For each quad of pixels, { (i, 7), (i + 1,7), (i,j +1), (i+ 1, j + 1)}, count the number 
of zero crossings along the four edges. 


5. When there are exactly two zero crossings, compute their locations using (7.25) and 


store these edgel endpoints along with the midpoint in the edgel structure. 


6. For each edgel, compute the local gradient by taking the horizontal and vertical differ- 


ences between the values of S along the zero crossing edges. 


7. Store the magnitude of this gradient as the edge strength and either its orientation or 
that of the segment joining the edgel endpoints as the edge orientation. 


8. Add the edgel to a list of edgels or store it in a 2D array of edgels (addressed by pixel 


coordinates). 


Ex 7.8: Edge linking and thresholding. Link up the edges computed in the previous exer- 
cise into chains and optionally perform thresholding with hysteresis. 


The steps may include: 


1. Store the edgels either in a 2D array (say, an integer image with indices into the edgel 
list) or pre-sort the edgel list first by (integer) x coordinates and then y coordinates, for 


faster neighbor finding. 


2. Pick up an edgel from the list of unlinked edgels and find its neighbors in both direc- 
tions until no neighbor is found or a closed contour is obtained. Flag edgels as linked 


as you visit them and push them onto your list of linked edgels. 


3. (Optional) Perform hysteresis-based thresholding (Canny 1986). Use two thresholds 
“hi” and “lo” for the edge strength. A candidate edgel is considered an edge if either 
its strength is above the “hi” threshold or its strength is above the “lo” threshold and it 
is (recursively) connected to a previously detected edge. 


4. (Optional) Link together contours that have small gaps but whose endpoints have sim- 


ilar orientations. 
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5. (Optional) Find junctions between adjacent contours, e.g., using some of the ideas (or 
references) from Maire, Arbelaez et al. (2008). 


Ex 7.9: Contour matching. Convert a closed contour (linked edgel list) into its arc-length 
parameterization and use this to match object outlines. 


The steps may include: 


1. Walk along the contour and create a list of (x;, yi, si) triplets, using the arc-length 
formula 


Si4+1 = Si + Xia = xi. (7.42) 


2. Resample this list onto a regular set of (xj, yj, j) samples using linear interpolation of 


each segment. 


3. Compute the average values of x and y, i.e., © and Y and subtract them from your 


sampled curve points. 


4. Resample the original (x;, yi, si) piecewise-linear function onto a length-independent 
set of samples, say j € [0,1023]. (Using a length which is a power of two makes 


subsequent Fourier transforms more convenient.) 


5. Compute the Fourier transform of the curve, treating each (x,y) pair as a complex 


number. 


6. To compare two curves, fit a linear equation to the phase difference between the two 
curves. (Careful: phase wraps around at 360°. Also, you may wish to weight samples 


by their Fourier spectrum magnitude—see Section 9.1.2.) 


7. (Optional) Prove that the constant phase component corresponds to the temporal shift 
in s, while the linear component corresponds to rotation. 


Of course, feel free to try any other curve descriptor and matching technique from the com- 
puter vision literature (Tek and Kimia 2003; Sebastian and Kimia 2005). 


Ex 7.10: Jigsaw puzzle solver—challenging. Write a program to automatically solve a 
jigsaw puzzle from a set of scanned puzzle pieces. Your software may include the following 
components: 


1. Scan the pieces (either face up or face down) on a flatbed scanner with a distinctively 


colored background. 


2. (Optional) Scan in the box top to use as a low-resolution reference image. 
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3. Use color-based thresholding to isolate the pieces. 
4. Extract the contour of each piece using edge finding and linking. 


5. (Optional) Re-represent each contour using an arc-length or some other re-parameterization. 


Break up the contours into meaningful matchable pieces. (Is this hard?) 
6. (Optional) Associate color values with each contour to help in the matching. 


7. (Optional) Match pieces to the reference image using some rotationally invariant fea- 


ture descriptors. 


8. Solve a global optimization or (backtracking) search problem to snap pieces together 


and place them in the correct location relative to the reference image. 


9. Test your algorithm on a succession of more difficult puzzles and compare your results 
with those of others. 


For some additional ideas, have a look at Cho, Avidan, and Freeman (2010). 


Ex 7.11: Successive approximation line detector. Implement a line simplification algo- 
rithm (Section 7.4.1) (Ramer 1972; Douglas and Peucker 1973) to convert a hand-drawn 
curve (or linked edge image) into a small set of polylines. 

(Optional) Re-render this curve using either an approximating or interpolating spline or 
Bezier curve (Szeliski and Ito 1986; Bartels, Beatty, and Barsky 1987; Farin 2002). 


Ex 7.12: Line fitting uncertainty. Estimate the uncertainty (covariance) in your line fit us- 
ing uncertainty analysis. 


1. After determining which edgels belong to the line segment (using either successive 
approximation or Hough transform), re-fit the line segment using total least squares 
(Van Huffel and Vandewalle 1991; Van Huffel and Lemmerling 2002), i.e., find the 
mean or centroid of the edgels and then use eigenvalue analysis to find the dominant 
orientation. 


2. Compute the perpendicular errors (deviations) to the line and robustly estimate the 


variance of the fitting noise using an estimator such as MAD (Appendix B.3). 


3. (Optional) re-fit the line parameters by throwing away outliers or using a robust norm 


or influence function. 


4. Estimate the error in the perpendicular location of the line segment and its orientation. 
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Ex 7.13: Vanishing points. Compute the vanishing points in an image using one of the 
techniques described in Section 7.4.3 and optionally refine the original line equations associ- 
ated with each vanishing point. Your results can be used later to track a target or reconstruct 
architecture (Section 13.6.1). 


Ex 7.14: Vanishing point uncertainty. Perform an uncertainty analysis on your estimated 
vanishing points. You will need to decide how to represent your vanishing point, e.g., homo- 
geneous coordinates on a sphere, to handle vanishing points near infinity. 

See the discussion of Bingham distributions by Collins and Weiss (1990) for some ideas. 


Ex 7.15: Region segmentation. Implement one of the region segmentation algorithms de- 
scribed in this chapter. Some popular segmentation algorithms include: 


k-means (Section 5.2.2); 


mixtures of Gaussians (Section 5.2.2); 


mean shift (Section 7.5.2); 


normalized cuts (Section 7.5.3); 


similarity graph-based segmentation (Section 7.5.1); 
e binary Markov random fields solved using graph cuts (Section 4.3.2). 


Apply your region segmentation to a video sequence and use it to track moving regions 
from frame to frame. 

Alternatively, test out your segmentation algorithm on the Berkeley segmentation database 
(Martin, Fowlkes et al. 2001). 
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Figure 8.1 Zmage stitching: (a) geometric alignment of 2D images for stitching (Szeliski 
and Shum 1997) © 1997 ACM; (b) a spherical panorama constructed from 54 photographs 
(Szeliski and Shum 1997) © 1997 ACM; (c) a multi-image panorama automatically assembled 
from an unordered photo collection; a multi-image stitch (d) without and (e) with moving 
object removal (Uyttendaele, Eden, and Szeliski 2001) © 2001 IEEE. 
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similarity Y) projective 
translation 
vy 


Euclidean 


Figure 8.2 Basic set of 2D planar transformations 


Once we have extracted features from images, the next stage in many vision algorithms 
is to match these features across different images (Section 7.1.3). An important component 
of this matching is to verify whether the set of matching features is geometrically consistent, 
e.g., whether the feature displacements can be described by a simple 2D or 3D geometric 
transformation. The computed motions can then be used in other applications such as image 
stitching (Section 8.2) or augmented reality (Section 11.2.2). 


In this chapter, we look at the topic of geometric image registration, i.e., the computation 
of 2D and 3D transformations that map features in one image to another (Section 8.1). In 
Chapter 11, we look at the related problems of pose estimation, which is determining a cam- 
era’s position relative to a known 3D object or scene, and structure from motion, i.e., how to 


simultaneously estimate 3D geometry and camera motion. 


8.1 Pairwise alignment 


Feature-based alignment is the problem of estimating the motion between two or more sets 
of matched 2D or 3D points. In this section, we restrict ourselves to global parametric trans- 
formations, such as those described in Section 2.1.1 and shown in Table 2.1 and Figure 8.2, 
or higher order transformation for curved surfaces (Shashua and Toelg 1997; Can, Stewart et 
al. 2002). Applications to non-rigid or elastic deformations (Bookstein 1989; Kambhamettu, 
Goldgof et al. 1994; Szeliski and Lavallée 1996; Torresani, Hertzmann, and Bregler 2008) 
are examined in Sections 9.2.2 and 13.6.4. 
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Transform Matrix Parameters p Jacobian J 
1 0 tz 1 0 
translation 0 1 ty (tz, ty) 0 1 
Co —sSọ tz 1 0 —sgr—Ccoy 
Euclidean So Co ty (te, ty, 0) 0 1 cox— sey 
l+a —b tz 1 0 xz -y 
similarity b l+a ty (tz, ty, a, b) 0 1 yY T 
1+ aoo aol ty 10 y 0 
affine 410 l1+a1 ty (tz, ty, a00, Q01, Q10, a11) 0100xzxmy 
1+hoo ho hoz 
hio 1+h11 hai 
projective hao hoy 1 (hoo, ho1,---, h21) (see Section 8.1.3) 


Table 8.1  Jacobians of the 2D coordinate transformations x’ = f(x; p) shown in Table 2.1, 


where we have re-parameterized the motions so that they are identity for p = 0. 


8.1.1 2D alignment using least squares 


Given a set of matched feature points {(x;, x’) } and a planar parametric transformation! of 
the form 


+ 


x = f(x; p), (8.1) 


how can we produce the best estimate of the motion parameters p? The usual way to do this 


is to use least squares, i.e., to minimize the sum of squared residuals 
Ers = $ lrill? =) f(x; p) — x4, (8.2) 
i i 


where 


ri = x; — f(xi; p) = $; — X; (8.3) 


is the residual between the measured location f; and its corresponding current predicted 
location X; = f (x;; p). (See Appendix A.2 for more on least squares and Appendix B.2 for a 


statistical justification.) 


lFor examples of non-planar parametric models, such as quadrics, see the work of Shashua and Toelg (1997) and 
Shashua and Wexler (2001). 
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Many of the motion models presented in Section 2.1.1 and Table 2.1, i.e., translation, 
similarity, and affine, have a linear relationship between the amount of motion Ax = x’ — x 


and the unknown parameters p, 
Ax = x’ — x = J(x)p, (8.4) 


where J = Of /Op is the Jacobian of the transformation f with respect to the motion param- 
eters p (see Table 8.1). In this case, a simple linear regression (linear least squares problem) 
can be formulated as 


Ers = 5 lJ (x:)p — Ax;||? (8.5) 


=p" |X I7(xi)I(x)| p— 2p” |X JT (x) Axi] + 9 Axil]? (8.6) 
=p’ Ap — 2p7b + c. (8.7) 


The minimum can be found by solving the symmetric positive definite (SPD) system of nor- 
mal equations? 


Ap =b, (8.8) 


where 


A= D JT (x;)I(x;) (8.9) 


is called the Hessian and b = >, J T(x;)Ax;. For the case of pure translation, the result- 
ing equations have a particularly simple form, i.e., the translation is the average translation 


between corresponding points or, equivalently, the translation of the point centroids. 


Uncertainty weighting. The above least squares formulation assumes that all feature points 
are matched with the same accuracy. This is often not the case, since certain points may fall 
into more textured regions than others. If we associate a scalar variance estimate o? with 


each correspondence, we can minimize the weighted least squares problem instead,’ 


EwrLs = AE (8.10) 


?For poorly conditioned problems, it is better to use QR decomposition on the set of linear equations J(x;)p = 
Ax; instead of the normal equations (Bjérck 1996; Golub and Van Loan 1996). However, such conditions rarely 
arise in image registration. 

3Problems where each measurement can have a different variance or uncertainty are called heteroscedastic mod- 
els. 
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Figure 8.3 A simple panograph consisting of three images automatically aligned with a 
translational model and then averaged together. 


As shown in Section 9.1.3, a covariance estimate for patch-based matching can be obtained 
by multiplying the inverse of the patch Hessian A; (9.48) with the per-pixel noise covariance 
0? (9.37). Weighting each squared residual by its inverse covariance x = 0, A; (which 


is called the information matrix), we obtain 


Eowis = Y [brill = Dorf Er; = Y op rl Air: (8.11) 


8.1.2 Application: Panography 


One of the simplest (and most fun) applications of image alignment is a special form of image 
stitching called panography. In a panograph, images are translated and optionally rotated and 
scaled before being blended with simple averaging (Figure 8.3). This process mimics the 
photographic collages created by artist David Hockney, although his compositions use an 
opaque overlay model, being created out of regular photographs. 

In most of the examples seen on the web, the images are aligned by hand for best artistic 
effect.* However, it is also possible to use feature matching and alignment techniques to 
perform the registration automatically (Nomura, Zhang, and Nayar 2007; Zelnik-Manor and 
Perona 2007). 

Consider a simple translational model. We want all the corresponding features in different 
images to line up as best as possible. Let t; be the location of the jth image coordinate frame 


in the global composite frame and x;; be the location of the ¿th matched feature in the jth 


4https://www.flickr.com/groups/panography. 
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image. In order to align the images, we wish to minimize the least squares error 


naea Iie ¿+ Xj) — xil’, (8.12) 


where x; is the consensus (average) position of feature 2 in the global coordinate frame. 
(An alternative approach is to register each pair of overlapping images separately and then 
compute a consensus location for each frame—see Exercise 8.2.) 

The above least squares problem is indeterminate (you can add a constant offset to all the 
frame and point locations t; and x;). To fix this, either pick one frame as being at the origin 
or add a constraint to make the average frame offsets be 0. 

The formulas for adding rotation and scale transformations are straightforward and are 
left as an exercise (Exercise 8.2). See if you can create some collages that you would be 


happy to share with others on the web. 


8.1.3 Iterative algorithms 


While linear least squares is the simplest method for estimating parameters, most problems in 
computer vision do not have a simple linear relationship between the measurements and the 
unknowns. In this case, the resulting problem is called non-linear least squares or non-linear 
regression. 

Consider, for example, the problem of estimating a rigid Euclidean 2D transformation 
(translation plus rotation) between two sets of points. If we parameterize this transformation 
by the translation amount (tz, ty) and the rotation angle 0, as in Table 2.1, the Jacobian of 
this transformation, given in Table 8.1, depends on the current value of 6. Notice how in 
Table 8.1, we have re-parameterized the motion matrices so that they are always the identity 
at the origin p = 0, which makes it easier to initialize the motion parameters. 

To minimize the non-linear least squares problem, we iteratively find an update Ap to the 


current parameter estimate p by minimizing 
Enis(Ap) = 2, lt (x:;p + Ap) — xP? (8.13) 
x y [|I(<;; p)Ap — rill? (8.14) 


= Ap? Y JTJ XI r| +5 lleill? (8.15) 


= Ap’ AAp— 2Ap"b+c, (8.16) 


Ap — 2Ap" 
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where the “Hessian”* A is the same as Equation (8.9) and the right-hand side vector 
b = J" (x;)r; (8.17) 


is now a Jacobian-weighted sum of residual vectors. This makes intuitive sense, as the pa- 
rameters are pulled in the direction of the prediction error with a strength proportional to the 
Jacobian. 


Once A and b have been computed, we solve for Ap using 
(A + Adiag(A))Ap = b, (8.18) 


and update the parameter vector p  p + Ap accordingly. The parameter A is an addi- 
tional damping parameter used to ensure that the system takes a “downhill” step in energy 
(squared error) and is an essential component of the Levenberg—Marquardt algorithm (de- 
scribed in more detail in Appendix A.3). In many applications, it can be set to 0 if the system 
is successfully converging. 

For the case of our 2D translation+rotation, we end up with a3 x 3 set of normal equations 
in the unknowns (dtz, dt,,56). An initial guess for (tz,t,,@) can be obtained by fitting a 
four-parameter similarity transform in (t+, t,, c,s) and then setting 9 = tan”*(s/c). An 
alternative approach is to estimate the translation parameters using the centroids of the 2D 
points and to then estimate the rotation angle using polar coordinates (Exercise 8.3). 

For the other 2D motion models, the derivatives in Table 8.1 are all fairly straightforward, 
except for the projective 2D motion (homography), which arises in image-stitching applica- 


tions (Section 8.2). These equations can be re-written from (2.21) in their new parametric 


form as 
1+h h h h 1+h h 
x= (1 + hoo) + hory + hoz dl oh = 10% +(1+h11)y + 12° (8.19) 
hoyt + hoy + 1 hox + hoy + 1 
The Jacobian is therefore 
Of 1 1 0 0 0 =x —a' 
A y A (8.20) 


~ Op DO 002 y 1 — yx —yyl|” 


where D = hox + ha1y + 1 is the denominator in (8.19), which depends on the current 
parameter settings (as do x’ and y”). 

An initial guess for the eight unknowns {hoo, ho1, . . . , h21 } can be obtained by multiply- 
ing both sides of the equations in (8.19) through by the denominator, which yields the linear 


5The “Hessian” A is not the true Hessian (second derivative) of the non-linear least squares problem (8.13). 


Instead, it is the approximate Hessian, which neglects second (and higher) order derivatives of f (x;; p + Ap). 
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set of equations, 


Tu x 


(8.21) 


y 1 0 0 —¿2 —Yy 
-yl lo 0.0% y 1 


-g'z —Yy 
har 
However, this is not optimal from a statistical point of view, since the denominator D, which 
was used to multiply each equation, can vary quite a bit from point to point.® 
One way to compensate for this is to reweight each equation by the inverse of the current 


estimate of the denominator, D, 


1 |2 -zr 1 jz 


: de A (8.22) 
D\i¢g=-y| 2/0 0 0 z y 1 — Ju fy 
While this may at first seem to be the exact same set of equations as (8.21), because least 
squares is being used to solve the over-determined set of equations, the weightings do matter 
and produce a different set of normal equations that performs better in practice. 
The most principled way to do the estimation, however, is to directly minimize the squared 
residual Equations (8.13) using the Gauss—Newton approximation, i.e., performing a first- 


order Taylor series expansion in p, as shown in (8.14), which yields the set of equations 


a i : A (8.23) 


While these look similar to (8.22), they differ in two important respects. First, the left-hand 
side consists of unweighted prediction errors rather than point displacements and the solution 
vector is a perturbation to the parameter vector p. Second, the quantities inside J involve 
predicted feature locations (2”, y”) instead of sensed feature locations (<’, 9’). Both of these 
differences are subtle and yet they lead to an algorithm that, when combined with proper 
checking for downhill steps (as in the Levenberg—Marquardt algorithm), will converge to a 
local minimum. Note that iterating Equations (8.22) is not guaranteed to converge, since it is 


not minimizing a well-defined energy function. 


Hartley and Zisserman (2004) call this strategy of forming linear equations from rational equations the direct 
linear transform, but that term is more commonly associated with pose estimation (Section 11.2). Note also that our 
definition of the h;; parameters differs from that used in their book, since we define h;; to be the difference from 


unity and we do not leave h22 as a free parameter, which means that we cannot handle certain extreme homographies. 
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Equation (8.23) is analogous to the additive algorithm for direct intensity-based regis- 
tration (Section 9.2), since the change to the full transformation is being computed. If we 
prepend an incremental homography to the current homography instead, i.e., we use a com- 
positional algorithm (described in Section 9.2), we get D = 1 (since p = 0) and the above 


formula simplifies to 
il — 1 0 0 0 =z? — 
El z| _ |£ y x E an (8.24) 
y —y 0 0 0 x y 1 —ry -y : 

where we have replaced (2”, y”) with (x, y) for conciseness. 


8.1.4 Robust least squares and RANSAC 


While regular least squares is the method of choice for measurements where the noise follows 
a normal (Gaussian) distribution, more robust versions of least squares are required when 
there are outliers among the correspondences (as there almost always are). In this case, it 
is preferable to use an M-estimator (Huber 1981; Hampel, Ronchetti et al. 1986; Black and 
Rangarajan 1996; Stewart 1999), which involves applying a robust penalty function p(r) to 


the residuals 
Epus(Ap) = da (llel) (8.25) 


instead of squaring them.” 


We can take the derivative of this function with respect to p and set it to 0, 


Abel - y 40D, Be 
r; r; =0, 8.26 
A = Le ap (8.26) 


where (r) = p' (r) is the derivative of p and is called the influence function. If we introduce 
a weight function, w(r) = w(r)/r, we observe that finding the stationary point of (8.25) using 
(8.26) is equivalent to minimizing the iteratively reweighted least squares (IRLS) problem 


Eris = lr lll, (8.27) 


where the w(||r:||) play the same local weighting role as o,” in (8.10). The IRLS algo- 
rithm alternates between computing the influence functions w(||r;||) and solving the resulting 


weighted least squares problem (with fixed w values). Other incremental robust least squares 


7The plots for some commonly used robust penalty functions p can be found in Figure 4.7. 
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algorithms can be found in the work of Sawhney and Ayer (1996), Black and Anandan (1996), 
Black and Rangarajan (1996), and Baker, Gross et al. (2003) and in textbooks and tutorials 
on robust statistics (Huber 1981; Hampel, Ronchetti et al. 1986; Rousseeuw and Leroy 1987; 
Stewart 1999). 

While M-estimators can definitely help reduce the influence of outliers, in some cases, 
starting with too many outliers will prevent IRLS (or other gradient descent algorithms) from 
converging to the global optimum. A better approach is often to find a starting set of inlier 
correspondences, i.e., points that are consistent with a dominant motion estimate.* 

Two widely used approaches to this problem are called RANdom SAmple Consensus, or 
RANSAC for short (Fischler and Bolles 1981), and least median of squares (LMS) (Rousseeuw 
1984). Both techniques start by selecting (at random) a subset of k correspondences, which is 
then used to compute an initial estimate for p. The residuals of the full set of correspondences 
are then computed as 

r; = X;(xi; p) — Xj, (8.28) 
where X; are the estimated (mapped) locations and x’, are the sensed (detected) feature point 
locations.’ 

The RANSAC technique then counts the number of inliers that are within e of their pre- 
dicted location, i.e., whose ||r;|| < e. (The e value is application dependent but is often 
around 1-3 pixels.) Least median of squares finds the median value of the ||r;[|? values. The 
random selection process is repeated S times and the sample set with the largest number of 
inliers (or with the smallest median residual) is kept as the final solution. Either the initial 
parameter guess p or the full set of computed inliers is then passed on to the next data fitting 
stage. 

When the number of measurements is quite large, it may be preferable to only score a 
subset of the measurements in an initial round that selects the most plausible hypotheses for 
additional scoring and selection. This modification of RANSAC, which can significantly 
speed up its performance, is called Preemptive RANSAC (Nistér 2003). In another variant on 
RANSAC called PROSAC (PROgressive SAmple Consensus), random samples are initially 
added from the most “confident” matches, thereby speeding up the process of finding a (sta- 
tistically) likely good set of inliers (Chum and Matas 2005). Raguram, Chum ef al. (2012) 
provide a unified framework from which most of these techniques can be derived as well as a 
nice experimental comparison. 

Additional variants on RANSAC include MLESAC (Torr and Zisserman 2000), DSAC 


8For pixel-based alignment methods (Section 9.1.1), hierarchical (coarse-to-fine) techniques are often used to 
lock onto the dominant motion in a scene. 
°For problems such as epipolar geometry estimation, the residual may be the distance between a point and a line. 
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k p S 


3 05 35 
6 06 97 


6 0.5 293 


Table 8.2 Number of trials S to attain a 99% probability of success (Stewart 1999). 


(Brachmann, Krull et al. 2017), Graph-Cut RANSAC (Barath and Matas 2018), MAGSAC 
(Barath, Matas, and Noskova 2019), and ESAC (Brachmann and Rother 2019). Some of 
these algorithms, such as DSAC (Differentiable RANSAC), are designed to be differentiable 
so they can be used in end-to-end training of feature detection and matching pipelines (Sec- 
tion 7.1). The MAGSAC++ paper by Barath, Noskova et al. (2020) compares many of these 
variants. Yang, Antonante et al. (2020) claim that using a robust penalty function with a 
decreasing outlier parameter, i.e., graduated non-convexity (Blake and Zisserman 1987; Bar- 
ron 2019), can outperform RANSAC in many geometric correspondence and pose estimation 
problems. To ensure that the random sampling has a good chance of finding a true set of in- 
liers, a sufficient number of trials S must be evaluated. Let p be the probability that any given 
correspondence is valid and P be the probability of success after S trials. The likelihood in 
one trial that all k random samples are inliers is p*. Therefore, the likelihood that S such 
trials will all fail is 

1-P=(1-p*y (8.29) 


and the required minimum number of trials is 


g= log(1 — P) 


= islo] Sy (8.30) 


Stewart (1999) gives examples of the required number of trials S to attain a 99% proba- 
bility of success. As you can see from Table 8.2, the number of trials grows quickly with the 
number of sample points used. This provides a strong incentive to use the minimum number 
of sample points k possible for any given trial, which is how RANSAC is normally used in 


practice. 


Uncertainty modeling 


In addition to robustly computing a good alignment, some applications require the compu- 
tation of uncertainty (see Appendix B.6). For linear problems, this estimate can be obtained 


by inverting the Hessian matrix (8.9) and multiplying it by the feature position noise, if these 


8.1 Pairwise alignment 513 


have not already been used to weight the individual measurements, as in Equations (8.10) 
and (8.11). In statistics, the Hessian, which is the inverse covariance, is sometimes called the 
(Fisher) information matrix (Appendix B.1). 

When the problem involves non-linear least squares, the inverse of the Hessian matrix 
provides the Cramer—Rao lower bound on the covariance matrix, i.e., it provides the minimum 
amount of covariance in a given solution, which can actually have a wider spread (“longer 
tails”) if the energy flattens out away from the local minimum where the optimal solution is 
found. 


8.1.5 3D alignment 


Instead of aligning 2D sets of image features, many computer vision applications require the 
alignment of 3D points. In the case where the 3D transformations are linear in the motion 
parameters, e.g., for translation, similarity, and affine, regular least squares (8.5) can be used. 


The case of rigid (Euclidean) motion, 


Erap = » lx; — Rx; — t]?, (8.31) 


which arises more frequently and is often called the absolute orientation problem (Horn 
1987), requires slightly different techniques. If only scalar weightings are being used (as 
opposed to full 3D per-point anisotropic covariance estimates), the weighted centroids of the 
two point clouds c and c’ can be used to estimate the translation t = e” — Re.!° We are then 
left with the problem of estimating the rotation between two sets of points {x; = x; — c} and 
Lx = x! — c’} that are both centered at the origin. 

One commonly used technique is called the orthogonal Procrustes algorithm (Golub and 
Van Loan 1996, p. 601) and involves computing the singular value decomposition (SVD) of 


the 3 x 3 correlation matrix 


C= 5 xx" =UYNV”. (8.32) 


The rotation matrix is then obtained as R = UV”. (Verify this for yourself when x” = Rx.) 

Another technique is the absolute orientation algorithm (Horn 1987) for estimating the 
unit quaternion corresponding to the rotation matrix R, which involves forming a 4 x 4 ma- 
trix from the entries in C and then finding the eigenvector associated with its largest positive 


eigenvalue. 


10 When full covariances are used, they are transformed by the rotation, so a closed-form solution for translation 


is not possible. 
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Lorusso, Eggert, and Fisher (1995) experimentally compare these two techniques to two 
additional techniques proposed in the literature, but find that the difference in accuracy is 
negligible (well below the effects of measurement noise). 

In situations where these closed-form algorithms are not applicable, e.g., when full 3D 
covariances are being used or when the 3D alignment is part of some larger optimization, the 
incremental rotation update introduced in Section 2.1.3 (2.35-2.36), which is parameterized 
by an instantaneous rotation vector w, can be used (See Section 8.2.3 for an application to 
image stitching.) 

In some situations, e.g., when merging range data maps, the correspondence between data 
points is not known a priori. In this case, iterative algorithms that start by matching nearby 
points and then update the most likely correspondence can be used (Besl and McKay 1992; 
Zhang 1994; Szeliski and Lavallée 1996; Gold, Rangarajan et al. 1998; David, DeMenthon 
et al. 2004; Li and Hartley 2007; Enqvist, Josephson, and Kahl 2009). These techniques are 


discussed in more detail in Section 13.2.1. 


8.2 Image stitching 


Algorithms for aligning images and stitching them into seamless photo-mosaics are among 
the oldest and most widely used in computer vision (Milgram 1975; Peleg 1981). Image 
stitching algorithms create the high-resolution photo-mosaics used to produce today’s digital 
maps and satellite photos. They are also now a standard mode in smartphone cameras and 
can be used to create beautiful ultra wide-angle panoramas. 

Image stitching originated in the photogrammetry community, where more manually in- 
tensive methods based on surveyed ground control points or manually registered tie points 
have long been used to register aerial photos into large-scale photo-mosaics (Slama 1980). 
One of the key advances in this community was the development of bundle adjustment algo- 
rithms (Section 11.4.2), which could simultaneously solve for the locations of all of the cam- 
era positions, thus yielding globally consistent solutions (Triggs, McLauchlan et al. 1999). 
Another recurring problem in creating photo-mosaics is the elimination of visible seams, for 
which a variety of techniques have been developed over the years (Milgram 1975, 1977; Peleg 
1981; Davis 1998; Agarwala, Dontcheva et al. 2004) 

In film photography, special cameras were developed in the 1990s to take ultra-wide- 
angle panoramas, often by exposing the film through a vertical slit as the camera rotated on 
its axis (Meehan 1990). In the mid-1990s, image alignment techniques started being applied 
to the construction of wide-angle seamless panoramas from regular hand-held cameras (Mann 
and Picard 1994; Chen 1995; Szeliski 1996). Subsequent algorithms addressed the need to 
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compute globally consistent alignments (Szeliski and Shum 1997; Sawhney and Kumar 1999; 
Shum and Szeliski 2000), to remove “ghosts” due to parallax and object movement (Davis 
1998; Shum and Szeliski 2000; Uyttendaele, Eden, and Szeliski 2001; Agarwala, Dontcheva 
et al. 2004), and to deal with varying exposures (Mann and Picard 1994; Uyttendaele, Eden, 
and Szeliski 2001; Levin, Zomet et al. 2004; Eden, Uyttendaele, and Szeliski 2006; Kopf, 
Uyttendaele et al. 2007).!! 

While early techniques worked by directly minimizing pixel-to-pixel dissimilarities, to- 
day’s algorithms extract a sparse set of features and match them to each other, as described in 
Chapter 7. Such feature-based approaches (Zoghlami, Faugeras, and Deriche 1997; Capel and 
Zisserman 1998; Cham and Cipolla 1998; Badra, Qumsieh, and Dudek 1998; McLauchlan 
and Jaenicke 2002; Brown and Lowe 2007) have the advantage of being more robust against 
scene movement and are usually faster,!* Their biggest advantage, however, is the ability to 
“recognize panoramas”, i.e., to automatically discover the adjacency (overlap) relationships 
among an unordered set of images, which makes them ideally suited for fully automated 
stitching of panoramas taken by casual users (Brown and Lowe 2007). 

What, then, are the essential problems in image stitching? As with image alignment, we 
must first determine the appropriate mathematical model relating pixel coordinates in one 
image to pixel coordinates in another; Section 8.2.1 reviews the basic models we have stud- 
ied and presents some new motion models related specifically to panoramic image stitching. 
Next, we must somehow estimate the correct alignments relating various pairs (or collections) 
of images. Chapter 7 discusses how distinctive features can be found in each image and then 
efficiently matched to rapidly establish correspondences between pairs of images. Chapter 9 
discusses how direct pixel-to-pixel comparisons combined with gradient descent (and other 
optimization techniques) can also be used to estimate these parameters. When multiple im- 
ages exist in a panorama, global optimization techniques can be used to compute a globally 
consistent set of alignments and to efficiently discover which images overlap one another. In 
Section 8.3, we look at how each of these previously developed techniques can be modified 
to take advantage of the imaging setups commonly used to create panoramas. 

Once we have aligned the images, we must choose a final compositing surface for warping 
the aligned images (Section 8.4.1). We also need algorithms to seamlessly cut and blend over- 
lapping images, even in the presence of parallax, lens distortion, scene motion, and exposure 
differences (Section 8.4.2-8.4.4). 


ILA collection of some of these papers was compiled by Benosman and Kang (2001) and they are surveyed by 
Szeliski (2006a). 

2 See a discussion of the pros and cons of direct vs. feature-based techniques in (Triggs, Zisserman, and Szeliski 
2000) and in the first edition of this book (Szeliski 2010, Section 8.3.4). 
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(a) translation [2 dof] (b) affine [6 dof] (c) perspective [8 dof] (d) 3D rotation [3+ dof] 


Figure 8.4 Two-dimensional motion models and how they can be used for image stitching. 


8.2.1 Parametric motion models 


Before we can register and align images, we need to establish the mathematical relationships 
that map pixel coordinates from one image to another. A variety of such parametric motion 
models are possible, from simple 2D transforms, to planar perspective models, 3D camera 
rotations, lens distortions, and mapping to non-planar (e.g., cylindrical) surfaces. 

We already covered several of these models in Sections 2.1 and 8.1. In particular, we saw 
in Section 2.1.4 how the parametric motion describing the deformation of a planar surface 
as viewed from different positions can be described with an eight-parameter homography 
(2.71) (Mann and Picard 1994; Szeliski 1996). We also saw how a camera undergoing a pure 
rotation induces a different kind of homography (2.72). 

In this section, we review both of these models and show how they can be applied to dif- 
ferent stitching situations. We also introduce spherical and cylindrical compositing surfaces 
and show how, under favorable circumstances, they can be used to perform alignment using 
pure translations (Section 8.2.6). Deciding which alignment model is most appropriate for a 
given situation or set of data is a model selection problem (Torr 2002; Bishop 2006; Robert 
2007; Hastie, Tibshirani, and Friedman 2009; Murphy 2012), an important topic we do not 
cover in this book. 


Planar perspective motion 


The simplest possible motion model to use when aligning images is to simply translate and 
rotate them in 2D (Figure 8.4a). This is exactly the same kind of motion that you would 
use if you had overlapping photographic prints. It is also the kind of technique favored by 
David Hockney to create the collages that he calls joiners (Zelnik-Manor and Perona 2007; 
Nomura, Zhang, and Nayar 2007). Creating such collages, which show visible seams and 
inconsistencies that add to the artistic effect, is popular on websites such as Flickr, where they 
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more commonly go under the name panography (Section 8.1.2). Translation and rotation are 
also usually adequate motion models to compensate for small camera motions in applications 
such as photo and video stabilization and merging (Exercise 8.1 and Section 9.2.1). 

In Section 2.1.4, we saw how the mapping between two cameras viewing a common plane 
can be described using a 3 x 3 homography (2.71). Consider the matrix Myo that arises when 
mapping a pixel in one image to a 3D point and then back onto a second image, 


%1 ~ P,P5 txo = Mino. (8.33) 


When the last row of the Py matrix is replaced with a plane equation fig - p + co and points 
are assumed to lie on this plane, i.e., their disparity is dy = 0, we can ignore the last column 
of Myo and also its last row, since we do not care about the final z-buffer depth. The resulting 
homography matrix Ho (the upper left 3 x 3 sub-matrix of Mj) describes the mapping 
between pixels in the two images, 

xı ~ Hioxo. (8.34) 


This observation formed the basis of some of the earliest automated image stitching al- 
gorithms (Mann and Picard 1994; Szeliski 1994, 1996). Because reliable feature matching 
techniques had not yet been developed, these algorithms used direct pixel value matching, i.e., 
direct parametric motion estimation, as described in Section 9.2 and Equations (8.19-8.20). 

More recent stitching algorithms first extract features and then match them up, often using 
robust techniques such as RANSAC (Section 8.1.4) to compute a good set of inliers. The final 
computation of the homography (8.34), i.e., the solution of the least squares fitting problem 
given pairs of corresponding features, 

2 (1 + hoo)£o + hor yo + hoz and (8.35) 
hot + hoiyo + 1 

jE hiozo + (1 + h11)yo + hi2 
ha0Zo + heiyo + 1 


(8.36) 


uses iterative least squares, as described in Section 8.1.3 and Equations (8.21-8.23). 


8.2.2 Application: Whiteboard and document scanning 


The simplest image-stitching application is to stitch together a number of image scans taken 
on a flatbed scanner. Say you have a large map, or a piece of child’s artwork, that is too large 
to fit on your scanner. Simply take multiple scans of the document, making sure to overlap 
the scans by a large enough amount to ensure that there are enough common features. Next, 


take successive pairs of images that you know overlap, extract features, match them up, and 
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Ha: 
(0,0,0,1)p=0 


Xo = (X0,¥o,fo) Xi = (x,y fi) 
Rio 


Figure 8.5 Pure 3D camera rotation. The form of the homography (mapping) is particu- 
larly simple and depends only on the 3D rotation matrix and focal lengths. 


estimate the 2D rigid transform (2.16), 
Xk+1 = RkXk + tk, (8.37) 


that best matches the features, using two-point RANSAC, if necessary, to find a good set 
of inliers. Then, on a final compositing surface (aligned with the first scan, for example), 
resample your images (Section 3.6.1) and average them together. Can you see any potential 
problems with this scheme? 

One complication is that a 2D rigid transformation is non-linear in the rotation angle 0, 
so you will have to either use non-linear least squares or constrain R to be orthonormal, as 
described in Section 8.1.3. 

A bigger problem lies in the pairwise alignment process. As you align more and more 
pairs, the solution may drift so that it is no longer globally consistent. In this case, a global op- 
timization procedure, as described in Section 8.3, may be required. Such global optimization 
often requires a large system of non-linear equations to be solved, although in some cases, 
such as linearized homographies (Section 8.2.3) or similarity transforms (Section 8.1.2), reg- 
ular least squares may be an option. 

A slightly more complex scenario is when you take multiple overlapping handheld pic- 
tures of a whiteboard or other large planar object (He and Zhang 2005; Zhang and He 2007). 
Here, the natural motion model to use is a homography, although a more complex model that 
estimates the 3D rigid motion relative to the plane (plus the focal length, if unknown), could 


in principle be used. 
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8.2.3 Rotational panoramas 


The most typical case for panoramic image stitching is when the camera undergoes a pure 
rotation. Think of standing at the rim of the Grand Canyon. Relative to the distant geometry 
in the scene, as you snap away, the camera is undergoing a pure rotation, which is equiv- 
alent to assuming that all points are very far from the camera, i.e., on the plane at infinity 
(Figure 8.5).! Setting ty = tı = 0, we get the simplified 3 x 3 homography 


Hijo =K¡R¡R¿'K,* =K¡RioK;?, (8.38) 


where K; = diag( fk, fx, 1) is the simplified camera intrinsic matrix (2.59), assuming that 
Cy = Cy = 0, i.e., we are indexing the pixels starting from the image center (Szeliski 1996). 


This can also be re-written as 


Ly fi ia Zo 
yl ~ fi Rio ta Yo (8.39) 
1 1 1 1 
or 
21 £o 
yı | ~ Rio | yol, (8.40) 
fi fo 


which reveals the simplicity of the mapping equations and makes all of the motion parameters 
explicit. Thus, instead of the general eight-parameter homography relating a pair of images, 
we get the three-, four-, or five-parameter 3D rotation motion models corresponding to the 
cases where the focal length f is known, fixed, or variable (Szeliski and Shum 1997). !4 
Estimating the 3D rotation matrix (and, optionally, focal length) associated with each image is 
intrinsically more stable than estimating a homography with a full eight degrees of freedom, 
which makes this the method of choice for large-scale image stitching algorithms (Szeliski 
and Shum 1997; Shum and Szeliski 2000; Brown and Lowe 2007). 

Given this representation, how do we update the rotation matrices to best align two over- 
lapping images? Given a current estimate for the homography Ho in (8.38), the best way to 


13In a more general (e.g., indoor) scene, if we want to ensure that there is no parallax (visible relative move- 
ment between objects at different depths), we need to rotate the camera around the lens’s front no-parallax point 
(Littlefield 2006). This can be achieved by using a specialized panoramic rotation head with a built-in translation 
stage (Houghton 2013) or by determining the front nodal point using observations of collinear points—see Debevec, 
Wenger et al. (2002) and Szeliski (2010, Figure 6.7). 

'4 An initial estimate of the focal lengths can be obtained using the intrinsic calibration techniques described in 
Section 11.1.3 or from EXIF tags. 
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update R40 is to prepend an incremental rotation matrix R(w) to the current estimate R10 
(Szeliski and Shum 1997; Shum and Szeliski 2000), 


H(w) = Ki R(w)RioK, | = [K¡R(w)K; *][K¡Ri0K, '] = DHio. (8.41) 


Note that here we have written the update rule in the compositional form, where the in- 
cremental update D is prepended to the current homography Ho. Using the small-angle 


approximation to R(w) given in (2.35), we can write the incremental update matrix as 


1 —Wz fiwy 
D = K,R(w)K;' = Ki (I+ [w]x Ky’ =| w 1 —fiws|. 642) 
—Wy/ fi We/ fi 1 
Notice how there is now a nice one-to-one correspondence between the entries in the D 


matrix and the hoo, ..., h21 parameters used in Table 8.1 and Equation (8.19), i.e., 


(hoo, ho1, hoz, hoo, h11,h12, hoo, har) = (0, —wz, fiwy,wz,0,—fiwe, —wy/ fi, we/ fi). 


(8.43) 
We can therefore apply the chain rule to Equations (8.24 and 8.43) to obtain 
al 2 [=| 
a —axy/fi heii —y 
np = 3 Wy | (8.44) 
y-y —(fity/fi)  zy/f T H 


which give us the linearized update equations needed to estimate w = (Wz, Wy, wz).15 Notice 
that this update rule depends on the focal length fı of the target view and is independent 
of the focal length fo of the template view. This is because the compositional algorithm 
essentially makes small perturbations to the target. Once the incremental rotation vector w 
has been computed, the R; rotation matrix can be updated using Rı + R(w)R1ı. 

The formulas for updating the focal length estimates are a little more involved and are 
given in Shum and Szeliski (2000). We will not repeat them here, since an alternative up- 
date rule, based on minimizing the difference between back-projected 3D rays, is given in 
Section 8.3.1. Figure 8.1a shows the alignment of four images under the 3D rotation motion 


model. 


8.2.4 Gap closing 


The techniques presented in this section can be used to estimate a series of rotation matrices 


and focal lengths, which can be chained together to create large panoramas. Unfortunately, 


I5This is the same as the rotational component of instantaneous rigid flow (Bergen, Anandan et al. 1992) and the 
update equations given by Szeliski and Shum (1997) and Shum and Szeliski (2000). 
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(a) (b) 


Figure 8.6 Gap closing (Szeliski and Shum 1997) O 1997 ACM: (a) A gap is visible when 
the focal length is wrong (f = 510). (b) No gap is visible for the correct focal length (f = 
468). 


because of accumulated errors, this approach will rarely produce a closed 360° panorama. 
Instead, there will invariably be either a gap or an overlap (Figure 8.6). 


We can solve this problem by matching the first image in the sequence with the last one. 
The difference between the two rotation matrix estimates associated with the repeated first 
image indicates the amount of misregistration. This error can be distributed evenly across the 
whole sequence by taking the quotient of the two quaternions associated with these rotations 
and dividing this “error quaternion” by the number of images in the sequence (Szeliski and 
Shum 1997). We can also update the estimated focal length based on the amount of misregis- 
tration. To do this, we first convert the error quaternion into a gap angle, 9, and then update 
the focal length using the equation f’ = f(1 — @,/360°). 


Figure 8.6a shows the end of registered image sequence and the first image. There is a 
big gap between the last image and the first, which are in fact the same image. The gap is 
32° because the wrong estimate of focal length (f = 510) was used. Figure 8.6b shows the 
registration after closing the gap with the correct focal length (f = 468). Notice that both 
mosaics show very little visual misregistration (except at the gap), yet Figure 8.6a has been 
computed using a focal length that has 9% error. Related approaches have been developed by 
Hartley (1994b), McMillan and Bishop (1995), Stein (1995), and Kang and Weiss (1997) to 
solve the focal length estimation problem using pure panning motion and cylindrical images. 


Unfortunately, this gap-closing heuristic only works for the kind of “one-dimensional” 
panorama where the camera is continuously turning in the same direction. In Section 8.3, we 
describe a different approach to removing gaps and overlaps that works for arbitrary camera 


motions. 
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Figure 8.7 Video stitching the background scene to create a single sprite image that can 
be transmitted and used to re-create the background in each frame (Lee, Chen et al. 1997) O 
1997 IEEE. 


8.2.5 Application: Video summarization and compression 


An interesting application of image stitching is the ability to summarize and compress videos 
taken with a panning camera. This application was first suggested by Teodosio and Ben- 
der (1993), who called their mosaic-based summaries salient stills. These ideas were then 
extended by Irani, Hsu, and Anandan (1995) and Irani and Anandan (1998) to additional 
applications, such as video compression and video indexing. While these early approaches 
used affine motion models and were therefore restricted to long focal lengths, the techniques 
were generalized by Lee, Chen et al. (1997) to full eight-parameter homographies and incor- 
porated into the MPEG-4 video compression standard, where the stitched background layers 
were called video sprites (Figure 8.7). 

While video stitching is in many ways a straightforward generalization of multiple-image 
stitching (Steedly, Pal, and Szeliski 2005; Baudisch, Tan et al. 2006), the potential presence 
of large amounts of independent motion, camera zoom, and the desire to visualize dynamic 
events impose additional challenges. For example, moving foreground objects can often be 
removed using median filtering. Alternatively, foreground objects can be extracted into a sep- 
arate layer (Sawhney and Ayer 1996) and later composited back into the stitched panoramas, 
sometimes as multiple instances to give the impressions of a “Chronophotograph” (Massey 
and Bender 1996) and sometimes as video overlays (Irani and Anandan 1998). Videos can 
also be used to create animated panoramic video textures (Section 14.5.2), in which different 
portions of a panoramic scene are animated with independently moving video loops (Agar- 
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wala, Zheng et al. 2005; Rav-Acha, Pritch et al. 2005; Joshi, Mehta et al. 2012; Yan, Liu, 
and Furukawa 2017; He, Liao et al. 2017; Oh, Joo et al. 2017), or to shine “video flashlights” 
onto a composite mosaic of a scene (Sawhney, Arpa et al. 2002). 

Video can also provide an interesting source of content for creating panoramas taken from 
moving cameras. While this invalidates the usual assumption of a single point of view (opti- 
cal center), interesting results can still be obtained. For example, the VideoBrush system of 
Sawhney, Kumar et al. (1998) uses thin strips taken from the center of the image to create a 
panorama taken from a horizontally moving camera. This idea can be generalized to other 
camera motions and compositing surfaces using the concept of mosaics on an adaptive mani- 
fold (Peleg, Rousso et al. 2000), and also used to generate panoramic stereograms (Ishiguro, 
Yamamoto, and Tsuji 1992; Peleg, Ben-Ezra, and Pritch 2001).!6 Related ideas have been 
used to create panoramic matte paintings for multiplane cel animation (Wood, Finkelstein ef 
al. 1997), for creating stitched images of scenes with parallax (Kumar, Anandan et al. 1995), 
and as 3D representations of more complex scenes using multiple-center-of-projection im- 
ages (Rademacher and Bishop 1998) and multi-perspective panoramas (Román, Garg, and 
Levoy 2004; Román and Lensch 2006; Agarwala, Agrawala et al. 2006; Kopf, Chen et al. 
2010). 

Another interesting variant on video-based panoramas is concentric mosaics (Section 
14.3.3) (Shum and He 1999). Here, rather than trying to produce a single panoramic image, 
the complete original video is kept and used to re-synthesize views (from different camera 
origins) using ray remapping (light field rendering), thus endowing the panorama with a sense 
of 3D depth. The same dataset can also be used to explicitly reconstruct the depth using multi- 
baseline stereo (Ishiguro, Yamamoto, and Tsuji 1992; Peleg, Ben-Ezra, and Pritch 2001; Li, 
Shum et al. 2004; Zheng, Kang et al. 2007). 


8.2.6 Cylindrical and spherical coordinates 


An alternative to using homographies or 3D motions to align images is to first warp the images 
into cylindrical coordinates and then use a pure translational model to align them (Chen 1995; 
Szeliski 1996). Unfortunately, this only works if the images are all taken with a level camera 
or with a known tilt angle. 

Assume for now that the camera is in its canonical position, i.e., its rotation matrix is the 
identity, R = I, so that the optical axis is aligned with the z-axis and the y-axis is aligned 
vertically. The 3D ray corresponding to an (x, y) pixel is therefore (x, y, f). 


16A similar technique was likely used in the Google Cardboard Camera, https://blog.google/products/google-vr/ 


cardboard-camera-ios. 
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Figure 8.8 Projection from 3D to (a) cylindrical and (b) spherical coordinates. 


We wish to project this image onto a cylindrical surface of unit radius (Szeliski 1996). 
Points on this surface are parameterized by an angle 0 and a height h, with the 3D cylindrical 


coordinates corresponding to (6, h) given by 
(sin0,h,cos0) œx (x,y, f), (8.45) 


as shown in Figure 8.8a. From this correspondence, we can compute the formula for the 
warped or mapped coordinates (Szeliski and Shum 1997), 
E -1 T 
xr = sh = stan P (8.46) 
Y 


where sis an arbitrary scaling factor (sometimes called the radius of the cylinder) that can be 


y =sh=s (8.47) 


set to s = f to minimize the distortion (scaling) near the center of the image.!” The inverse 
of this mapping equation is given by 


t 


£= ftan = ftan—, (8.48) 
Ss 
y y x 
y = hy x? + f? = “Sy 1+ tan? g'/s = f— sec —. (8.49) 


Images can also be projected onto a spherical surface (Szeliski and Shum 1997), which 
is useful if the final panorama includes a full sphere or hemisphere of views, instead of just 
a cylindrical strip. In this case, the sphere is parameterized by two angles (0, ¢), with 3D 


spherical coordinates given by 


(sin O cos ¢, sin d, cos 0 cos d) œ (x,y, F), (8.50) 


'TThe scale can also be set to a larger or smaller value for the final compositing surface, depending on the desired 


output panorama resolution—see Section 8.4. 
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(a) (b) 


Figure 8.9 A cylindrical panorama (Szeliski and Shum 1997) O 1997 ACM: (a) two cylin- 
drically warped images related by a horizontal translation; (b) part of a cylindrical panorama 


composited from a sequence of images. 


as shown in Figure 8.8b.'% The correspondence between coordinates is now given by (Szeliski 
and Shum 1997): 


a! = s0 = stan”! - (8.51) 
! = s = stan ™! 4 _, (8.52) 
y = sọ PP 
while the inverse is given by 
x' 
x = f tan = f tan = (8.53) 
y! y xa 
y = y 22 + f?tand = tan 7 1+ tan? x'/s = f tan I sec (8.54) 


Note that it may be simpler to generate a scaled (x,y,z) direction from Equation (8.50) 
followed by a perspective division by z and a scaling by f. 

Cylindrical image stitching algorithms are most commonly used when the camera is 
known to be level and only rotating around its vertical axis (Chen 1995). Under these condi- 
tions, images at different rotations are related by a pure horizontal translation.!? This makes 
it attractive as an initial class project in an introductory computer vision course, since the 
full complexity of the perspective alignment algorithm (Sections 8.1, 9.2, and 8.2.3) can be 
avoided. Figure 8.9 shows how two cylindrically warped images from a leveled rotational 
panorama are related by a pure translation (Szeliski and Shum 1997). 


'8Note that these are not the usual spherical coordinates, first presented in Equation (2.8). Here, the y-axis points 
at the north pole instead of the z-axis, since we are used to viewing images taken horizontally, i.e., with the y-axis 
pointing in the direction of the gravity vector. 

19Small vertical tilts can sometimes be compensated for with vertical translations. 
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Professional panoramic photographers often use pan-tilt heads that make it easy to control 
the tilt and to stop at specific detents in the rotation angle. Motorized rotation heads are also 
sometimes used for the acquisition of larger panoramas (Kopf, Uyttendaele et al. 2007).2 
Not only do they ensure a uniform coverage of the visual field with a desired amount of 
image overlap but they also make it possible to stitch the images using cylindrical or spherical 
coordinates and pure translations. In this case, pixel coordinates (x, y, f) must first be rotated 
using the known tilt and panning angles before being projected into cylindrical or spherical 
coordinates (Chen 1995). Having a roughly known panning angle also makes it easier to 
compute the alignment, as the rough relative positioning of all the input images is known 
ahead of time, enabling a reduced search range for alignment. Figure 8.1b shows a full 3D 
rotational panorama unwrapped onto the surface of a sphere (Szeliski and Shum 1997). 

One final coordinate mapping worth mentioning is the polar mapping, where the north 


pole lies along the optical axis rather than the vertical axis, 
(cos 8 sin ¢, sin 8 sin d, cos p) = s (a, y, z). (8.55) 


In this case, the mapping equations become 


1 


a’ = sbcos@ = sZ tan! L, (8.56) 


Ma (8.57) 


RIFalr 


y = sbsind = stan” 
r 


where r = y/1? + y? is the radial distance in the (x,y) plane and sọ plays a similar role 


in the (a’ 


,y’) plane. This mapping provides an attractive visualization surface for certain 
kinds of wide-angle panoramas and is also a good model for the distortion induced by fisheye 
lenses, as discussed in Section 2.1.5. Note how for small values of (x,y), the mapping 


equations reduce to x’ ~ sx/z, which suggests that s plays a role similar to the focal length 


f. 


8.3 Global alignment 


So far, we have discussed how to register pairs of images using a variety of motion models. In 
most applications, we are given more than a single pair of images to register. The goal is then 
to find a globally consistent set of alignment parameters that minimize the misregistration 
between all pairs of images (Szeliski and Shum 1997; Shum and Szeliski 2000; Sawhney and 
Kumar 1999; Coorg and Teller 2000). 


See also https://gigapan.org. 
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In this section, we extend the pairwise matching criteria (8.2, 9.1, and 9.43) to a global 
energy function that involves all of the per-image pose parameters (Section 8.3.1). Once we 
have computed the global alignment, we often need to perform local adjustments, such as 
parallax removal, to reduce double images and blurring due to local misregistrations (Sec- 
tion 8.3.2). Finally, if we are given an unordered set of images to register, we need to discover 
which images go together to form one or more panoramas. This process of panorama recog- 


nition is described in Section 8.3.3. 


8.3.1 Bundle adjustment 


One way to register a large number of images is to add new images to the panorama one 
at a time, aligning the most recent image with the previous ones already in the collection 
(Szeliski and Shum 1997) and discovering, if necessary, which images it overlaps (Sawhney 
and Kumar 1999). In the case of 360° panoramas, accumulated error may lead to the presence 
of a gap (or excessive overlap) between the two ends of the panorama, which can be fixed by 
stretching the alignment of all the images using a process called gap closing (Section 8.2.4). 
However, a better alternative is to simultaneously align all the images using a least-squares 


framework to correctly distribute any misregistration errors. 


The process of simultaneously adjusting pose parameters and 3D point locations for a 
large collection of overlapping images is called bundle adjustment in the photogrammetry 
community (Triggs, McLauchlan et al. 1999). In computer vision, it was first applied to the 
general structure from motion problem (Szeliski and Kang 1994) and then later specialized 
for panoramic image stitching (Shum and Szeliski 2000; Sawhney and Kumar 1999; Coorg 
and Teller 2000). 


In this section, we formulate the problem of global alignment using a feature-based ap- 
proach, since this results in a simpler system. An equivalent direct approach can be obtained 
either by dividing images into patches and creating a virtual feature correspondence for each 
one (Shum and Szeliski 2000) or by replacing the per-feature error metrics with per-pixel 


metrics (Irani and Anandan 1999), 


Before we describe this in more details, we should mention that a simpler, although less 
accurate, approach is to compute pairwise rotation estimates between overlapping images, 
and to then use a rotation averaging approach to estimate a global rotation for each camera 
(Hartley, Trumpf et al. 2013). However, since the measurement errors in each feature point 
location are not being counted correctly, as is the case in bundle adjustment, the solution will 


not have the same theoretical optimality. 
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Consider the feature-based alignment problem given in Equation (8.2), i.e., 
Epairwise-LS — 5 lell? =— IG: p) a $]. (8.58) 


For multi-image alignment, instead of having a single collection of pairwise feature corre- 
spondences, {(x;, $; )}, we have a collection of n features, with the location of the ith feature 
point in the jth image denoted by x;; and its scalar confidence (i.e., inverse variance) denoted 
by cij?! Each image also has some associated pose parameters. 

In this section, we assume that this pose consists of a rotation matrix R; and a focal 
length fj, although formulations in terms of homographies are also possible (Szeliski and 
Shum 1997; Sawhney and Kumar 1999). The equation mapping a 3D point x; into a point 


Xij in frame j can be re-written from Equations (2.68) and (8.38) as 
Xi ~ KjRjx; and x; R,'K,'X;, (8.59) 
where K; = diag(f;, fj, 1) is the simplified form of the calibration matrix. The motion 
mapping a point x;¿ from frame j into a point x;z in frame k is similarly given by 
Zi ~ Hp Xi; = K ¡RR K xj. (8.60) 


Given an initial set of {(R,, f;)} estimates obtained from chaining pairwise alignments, how 
do we refine these estimates? 

One approach is to directly extend the pairwise energy Epairwise—Ls (8.58) to a multiview 
formulation, 


Kin (Riz; Ry, fj, Res fe) — irll’, (8.61) 


Eall—pairs—2D = > > CijCik| 


i jk 


where the X;z function is the predicted location of feature i in frame k given by (8.60), 
X;,¡ is the observed location, and the “2D” in the subscript indicates that an image-plane 
error is being minimized (Shum and Szeliski 2000). Note that since X; depends on the X;; 
observed value, we actually have an errors-in-variable problem, which in principle requires 
more sophisticated techniques than least squares to solve (Van Huffel and Lemmerling 2002; 
Matei and Meer 2006). However, in practice, if we have enough features, we can directly 
minimize the above quantity using regular non-linear least squares and obtain an accurate 
multi-frame alignment. 

While this approach works pretty well, it suffers from two potential disadvantages. First, 


because a summation is taken over all pairs with corresponding features, features that are 


21 Features that are not seen in image j have c; j = 0. We can also use 2 x 2 inverse covariance matrices X; F in 


place of cij, as shown in Equation (8.11). 


8.3 Global alignment 529 


observed many times are overweighted in the final solution. (In effect, a feature observed m 
times gets counted (3) times instead of m times.) Second, the derivatives of X; with respect 
to the {(R,, f;)} are a little cumbersome, although using the incremental correction to R; 
introduced in Section 8.2.3 makes this more tractable. 

An alternative way to formulate the optimization is to use true bundle adjustment, i.e., to 


solve not only for the pose parameters {(R,, f;)} but also for the 3D point positions (Xx; |, 


Epson = >_> cgi (x3 Ry, fi) — Rell, (8.62) 
i j 

where X;¿(x;; Rj, fj) is given by (8.59). The disadvantage of full bundle adjustment is that 
there are more variables to solve for, so each iteration and also the overall convergence may 
be slower. (Imagine how the 3D points need to “shift” each time some rotation matrices are 
updated.) However, the computational complexity of each linearized Gauss—Newton step can 
be reduced using sparse matrix techniques (Section 11.4.3) (Szeliski and Kang 1994; Triggs, 
McLauchlan et al. 1999; Hartley and Zisserman 2004). 

An alternative formulation is to minimize the error in 3D projected ray directions (Shum 
and Szeliski 2000), i.e., 


i 


Epa-3D = Y clk: (Riz Ry, fj) — x:ll?, (8.63) 
J 


where X;(x;j; R4, fj) is given by the second half of (8.59). This has no particular advantage 
over (8.62). In fact, since errors are being minimized in 3D ray space, there is a bias towards 
estimating longer focal lengths, since the angles between rays become smaller as f increases. 

However, if we eliminate the 3D rays x;, we can derive a pairwise energy formulated in 
3D ray space (Shum and Szeliski 2000), 


Esn—pairs—3D = > Y Cig Cir lži Rij; Ry, fj) — Ri (Ritos Re, fe) I. (8.64) 
i jk 

This results in the simplest set of update equations (Shum and Szeliski 2000), since the f;, can 
be folded into the creation of the homogeneous coordinate vector as in Equation (8.40). Thus, 
even though this formula over-weights features that occur more frequently, it is the method 
used by Shum and Szeliski (2000) and Brown, Szeliski, and Winder (2005). To reduce the 
bias towards longer focal lengths, we multiply each residual (3D error) by J Tift , which is 
similar to projecting the 3D rays into a “virtual camera” of intermediate focal length. 


Up vector selection. As mentioned above, there exists a global ambiguity in the pose of the 


3D cameras computed by the above methods. While this may not appear to matter, people 
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prefer that the final stitched image is “upright” rather than twisted or tilted. More concretely, 
people are used to seeing photographs displayed so that the vertical (gravity) axis points 
straight up in the image. Consider how you usually shoot photographs: while you may pan 
and tilt the camera any which way, you usually keep the horizontal edge of your camera (its 
x-axis) parallel to the ground plane (perpendicular to the world gravity direction). 
Mathematically, this constraint on the rotation matrices can be expressed as follows. Re- 


call from Equation (8.59) that the 3D to 2D projection is given by 
Xin ~ K¿R¿x;. (8.65) 


We wish to post-multiply each rotation matrix Rx by a global rotation Rg such that the pro- 
jection of the global y-axis, 7 = (0, 1, 0) is perpendicular to the image z-axis, 7 = (1, 0,0).2 
This constraint can be written as 


i7R,Rcj = 0 (8.66) 


(note that the scaling by the calibration matrix is irrelevant here). This is equivalent to re- 
quiring that the first row of Rk, rko = ¿"Ry be perpendicular to the second column of Rc, 
rai = Raj. This set of constraints (one per input image) can be written as a least squares 
problem, 
: T 2 ES N T 

rqi = arg min Dt rko) argminr > sorho r. (8.67) 
Thus, rg, is the smallest eigenvector of the scatter or moment matrix spanned by the indi- 
vidual camera rotation x-vectors, which should generally be of the form (c, 0, s) when the 
cameras are upright. 

To fully specify the Ra global rotation, we need to specify one additional constraint. 
This is related to the view selection problem discussed in Section 8.4.1. One simple heuristic 
is to prefer the average z-axis of the individual rotation matrices, k = Y k KTR, to be close 
to the world z-axis, reg = Rok. We can therefore compute the full rotation matrix Rg in 


three steps: 
1. rqi = min eigenvector (X`, rkorfo); 
2. Tao = NUS r2) X Yai); 


3. ra2 =YTG0 X rai, 


where N (v) = v/||v|| normalizes a vector v. 


2Note that here we use the convention common in computer graphics that the vertical world axis corresponds to 
y. This is a natural choice if we wish the rotation matrix associated with a “regular” image taken horizontally to be 


the identity, rather than a 90° rotation around the x-axis. 
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8.3.2 Parallax removal 


Once we have optimized the global orientations and focal lengths of our cameras, we may find 
that the images are still not perfectly aligned, i.e., the resulting stitched image looks blurry 
or ghosted in some places. This can be caused by a variety of factors, including unmodeled 
radial distortion, 3D parallax (failure to rotate the camera around its front nodal point), small 
scene motions such as waving tree branches, and large-scale scene motions such as people 
moving in and out of pictures. 

Each of these problems can be treated with a different approach. Radial distortion can be 
estimated (potentially ahead of time) using one of the techniques discussed in Section 2.1.5. 
For example, the plumb-line method (Brown 1971; Kang 2001; El-Melegy and Farag 2003) 
adjusts radial distortion parameters until slightly curved lines become straight, while mosaic- 
based approaches adjust them until misregistration is reduced in image overlap areas (Stein 
1997; Sawhney and Kumar 1999). 

3D parallax can be handled by doing a full 3D bundle adjustment, i.e., by replacing the 
projection Equation (8.59) used in Equation (8.62) with Equation (2.68), which models cam- 
era translations. The 3D positions of the matched feature points and cameras can then be si- 
multaneously recovered, although this can be significantly more expensive than parallax-free 
image registration. Once the 3D structure has been recovered, the scene could (in theory) be 
projected to a single (central) viewpoint that contains no parallax. However, to do this, dense 
stereo correspondence needs to be performed (Section 12.3) (Li, Shum et al. 2004; Zheng, 
Kang et al. 2007), which may not be possible if the images contain only partial overlap. In 
that case, it may be necessary to correct for parallax only in the overlap areas, which can be 
accomplished using a multi-perspective plane sweep (MPPS) algorithm (Kang, Szeliski, and 
Uyttendaele 2004; Uyttendaele, Criminisi et al. 2004). 

When the motion in the scene is very large, i.e., when objects appear and disappear com- 
pletely, a sensible solution is to simply select pixels from only one image at a time as the 
source for the final composite (Milgram 1977; Davis 1998; Agarwala, Dontcheva et al. 2004), 
as discussed in Section 8.4.2. However, when the motion is reasonably small (on the order of 
a few pixels), general 2D motion estimation (optical flow) can be used to perform an appro- 
priate correction before blending using a process called local alignment (Shum and Szeliski 
2000; Kang, Uyttendaele et al. 2003). This same process can also be used to compensate 
for radial distortion and 3D parallax, although it uses a weaker motion model than explic- 
itly modeling the source of error and may, therefore, fail more often or introduce unwanted 
distortions. 

The local alignment technique introduced by Shum and Szeliski (2000) starts with the 


global bundle adjustment (8.64) used to optimize the camera poses. Once these have been 
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Figure 8.10  Deghosting a mosaic with motion parallax (Shum and Szeliski 2000) O 2000 
IEEE: (a) composite with parallax; (b) after a single deghosting step (patch size 32); (c) after 
multiple steps (sizes 32, 16 and 8). 


estimated, the desired location of a 3D point x; can be estimated as the average of the back- 
projected 3D locations, 


Ki Y cli Ry, fs) e Y Ea (8.68) 
j j 


which can be projected into each image j to obtain a target location Xij. The difference 
between the target locations X;; and the original features x;; provide a set of local motion 
estimates 

Uij = Xij — Xij; (8.69) 
which can be interpolated to form a dense correction field u; (x;). In their system, Shum and 
Szeliski (2000) use an inverse warping algorithm where the sparse —u,;; values are placed 
at the new target locations x;;, interpolated using bilinear kernel functions (Nielson 1993) 
and then added to the original pixel coordinates when computing the warped (corrected) 
image. To get a reasonably dense set of features to interpolate, Shum and Szeliski (2000) 
place a feature point at the center of each patch (the patch size controls the smoothness in 
the local alignment stage), rather than relying on features extracted using an interest operator 
(Figure 8.10). 

An alternative approach to motion-based deghosting was proposed by Kang, Uyttendaele 
et al. (2003), who estimate dense optical flow between each input image and a central refer- 
ence image. The accuracy of the flow vector is checked using a photo-consistency measure 
before a given warped pixel is considered valid and is used to compute a high dynamic range 
radiance estimate, which is the goal of their overall algorithm. The requirement for a ref- 
erence image makes their approach less applicable to general image mosaicing, although an 
extension to this case could certainly be envisaged. 

The idea of combining global parametric warps with local mesh-based warps or multiple 


motion models to compensate for parallax has been refined in a number of more recent papers 
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(Zaragoza, Chin et al. 2013; Zhang and Liu 2014; Lin, Pankanti et al. 2015; Lin, Jiang et al. 
2016; Herrmann, Wang et al. 2018b; Lee and Sim 2020). Some of these papers use content- 
preserving warps (Liu, Gleicher et al. 2009) for their local deformations, while others include 
a rolling shutter model (Zhuang and Tran 2020). 


8.3.3 Recognizing panoramas 


The final piece needed to perform fully automated image stitching is a technique to recognize 
which images actually go together, which Brown and Lowe (2007) call recognizing panora- 
mas. Tf the user takes images in sequence so that each image overlaps its predecessor and 
also specifies the first and last images to be stitched, bundle adjustment combined with the 
process of topology inference can be used to automatically assemble a panorama (Sawhney 
and Kumar 1999). However, users often jump around when taking panoramas, e.g., they 
may start a new row on top of a previous one, jump back to take a repeat shot, or create 
360° panoramas where end-to-end overlaps need to be discovered. Furthermore, the ability 
to discover multiple panoramas taken by a user over an extended period of time can be a big 
convenience. 

To recognize panoramas, Brown and Lowe (2007) first find all pairwise image overlaps 
using a feature-based method and then find connected components in the overlap graph to 
“recognize” individual panoramas (Figure 8.11). The feature-based matching stage first ex- 
tracts scale invariant feature transform (SIFT) feature locations and feature descriptors (Lowe 
2004) from all the input images and places them in an indexing structure, as described in Sec- 
tion 7.1.3. For each image pair under consideration, the nearest matching neighbor is found 
for each feature in the first image, using the indexing structure to rapidly find candidates and 
then comparing feature descriptors to find the best match. RANSAC is used to find a set of in- 
lier matches; pairs of matches are used to hypothesize similarity motion models that are then 
used to count the number of inliers. A RANSAC algorithm tailored specifically for rotational 
panoramas is described by Brown, Hartley, and Nistér (2007). 

In practice, the most difficult part of getting a fully automated stitching algorithm to 
work is deciding which pairs of images actually correspond to the same parts of the scene. 
Repeated structures such as windows (Figure 8.12) can lead to false matches when using 
a feature-based approach. One way to mitigate this problem is to perform a direct pixel- 
based comparison between the registered images to determine if they actually are different 
views of the same scene. Unfortunately, this heuristic may fail if there are moving objects 
in the scene (Figure 8.13). While there is no magic bullet for this problem, short of full 
scene understanding, further improvements can likely be made by applying domain-specific 


heuristics, such as priors on typical camera motions as well as machine learning techniques 
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Figure 8.11 = Recognizing panoramas (Brown, Szeliski, and Winder 2005), figures cour- 
tesy of Matthew Brown: (a) input images with pairwise matches; (b) images grouped into 
connected components (panoramas); (c) individual panoramas registered and blended into 
stitched composites. 
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Figure 8.12 Matching errors (Brown, Szeliski, and Winder 2004): accidental matching of 


several features can lead to matches between pairs of images that do not actually overlap. 


Figure 8.13 Validation of image matches by direct pixel error comparison can fail when 


the scene contains moving objects (Uyttendaele, Eden, and Szeliski 2001) O 2001 IEEE. 
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applied to the problem of match validation. 


8.4 Compositing 


Once we have registered all of the input images with respect to each other, we need to decide 
how to produce the final stitched mosaic image. This involves selecting a final compositing 
surface (flat, cylindrical, spherical, etc.) and view (reference image). It also involves selecting 
which pixels contribute to the final composite and how to optimally blend these pixels to 
minimize visible seams, blur, and ghosting. 

In this section, we review techniques that address these problems, namely compositing 
surface parameterization, pixel and seam selection, blending, and exposure compensation. 
Our emphasis is on fully automated approaches to the problem. Since the creation of high- 
quality panoramas and composites is as much an artistic endeavor as a computational one, 
various interactive tools have been developed to assist this process (Agarwala, Dontcheva 
et al. 2004; Li, Sun et al. 2004; Rother, Kolmogorov, and Blake 2004). Some of these are 


covered in more detail in Section 10.4. 


8.4.1 Choosing a compositing surface 


The first choice to be made is how to represent the final image. If only a few images are 
stitched together, a natural approach is to select one of the images as the reference and to 
then warp all of the other images into its reference coordinate system. The resulting com- 
posite is sometimes called a flat panorama, since the projection onto the final surface is still 
a perspective projection, and hence straight lines remain straight (which is often a desirable 
attribute).”° 

For larger fields of view, however, we cannot maintain a flat representation without exces- 
sively stretching pixels near the border of the image. (In practice, flat panoramas start to look 
severely distorted once the field of view exceeds 90° or so.) The usual choice for compositing 
larger panoramas is to use a cylindrical (Chen 1995; Szeliski 1996) or spherical (Szeliski and 
Shum 1997) projection, as described in Section 8.2.6. In fact, any surface used for environ- 
ment mapping in computer graphics can be used, including a cube map, which represents 
the full viewing sphere with the six square faces of a cube (Greene 1986; Szeliski and Shum 
1997). Cartographers have also developed a number of alternative methods for representing 
the globe (Bugayevskiy and Snyder 1995). 


2 Techniques have also been developed to straighten curved lines in cylindrical and spherical panoramas (Carroll, 
Agrawala, and Agarwala 2009; Kopf, Lischinski et al. 2009; Carroll, Agarwala, and Agrawala 2010). 


8.4 Compositing 537 


The choice of parameterization is somewhat application-dependent and involves a trade- 
off between keeping the local appearance undistorted (e.g., keeping straight lines straight) 
and providing a reasonably uniform sampling of the environment. Automatically making 
this selection and smoothly transitioning between representations based on the extent of the 
panorama is discussed in Kopf, Uyttendaele et al. (2007). A recent trend in panoramic pho- 
tography has been the use of stereographic projections looking down at the ground (in an 


outdoor scene) to create “little planet” renderings.” 


View selection. Once we have chosen the output parameterization, we still need to deter- 
mine which part of the scene will be centered in the final view. As mentioned above, for a flat 
composite, we can choose one of the images as a reference. Often, a reasonable choice is the 
one that is geometrically most central. For example, for rotational panoramas represented as 
a collection of 3D rotation matrices, we can choose the image whose z-axis is closest to the 
average z-axis (assuming a reasonable field of view). Alternatively, we can use the average 
z-axis (or quaternion, but this is trickier) to define the reference rotation matrix. 

For larger, e.g., cylindrical or spherical, panoramas, we can use the same heuristic if a 
subset of the viewing sphere has been imaged. In the case of full 360° panoramas, a better 
choice is to choose the middle image from the sequence of inputs, or sometimes the first 
image, assuming this contains the object of greatest interest. In all of these cases, having the 
user control the final view is often highly desirable. If the “up vector” computation described 
in Section 8.3.1 is working correctly, this can be as simple as panning over the image or 
setting a vertical “center line” for the final panorama. 


Coordinate transformations. After selecting the parameterization and reference view, we 
still need to compute the mappings between the input and output pixels coordinates. 

If the final compositing surface is flat (e.g., a single plane or the face of a cube map) and 
the input images have no radial distortion, the coordinate transformation is the simple ho- 
mography described by Equation (8.38). This kind of warping can be performed in graphics 
hardware by appropriately setting texture mapping coordinates and rendering a single quadri- 
lateral. 

If the final composite surface has some other analytic form (e.g., cylindrical or spherical), 
we need to convert every pixel in the final panorama into a viewing ray (3D point) and then 
map it back into each image according to the projection (and optionally radial distortion) 


equations. This process can be made more efficient by precomputing some lookup tables, 


These are inspired by The Little Prince by Antoine De Saint-Exupery. Go to https://www.flickr.com and search 


for “little planet projection”. 
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e.g., the partial trigonometric functions needed to map cylindrical or spherical coordinates to 
3D coordinates or the radial distortion field at each pixel. It is also possible to accelerate this 
process by computing exact pixel mappings on a coarser grid and then interpolating these 
values. 

When the final compositing surface is a texture-mapped polyhedron, a slightly more so- 
phisticated algorithm must be used. Not only do the 3D and texture map coordinates have to 
be properly handled, but a small amount of overdraw outside the triangle footprints in the tex- 
ture map is necessary, to ensure that the texture pixels being interpolated during 3D rendering 
have valid values (Szeliski and Shum 1997). 


Sampling issues. While the above computations can yield the correct (fractional) pixel ad- 
dresses in each input image, we still need to pay attention to sampling issues. For example, 
if the final panorama has a lower resolution than the input images, prefiltering the input im- 
ages is necessary to avoid aliasing. These issues have been extensively studied in both the 
image processing and computer graphics communities. The basic problem is to compute the 
appropriate prefilter, which depends on the distance (and arrangement) between neighboring 
samples in a source image. As discussed in Sections 3.5.2 and 3.6.1, various approximate 
solutions, such as MIP mapping (Williams 1983) or elliptically weighted Gaussian averaging 
(Greene and Heckbert 1986) have been developed in the graphics community. For highest 
visual quality, a higher order (e.g., cubic) interpolator combined with a spatially adaptive pre- 
filter may be necessary (Wang, Kang et al. 2001). Under certain conditions, it may also be 
possible to produce images with a higher resolution than the input images using the process 


of super-resolution (Section 10.3). 


8.4.2 Pixel selection and weighting (deghosting) 


Once the source pixels have been mapped onto the final composite surface, we must still 
decide how to blend them in order to create an attractive-looking panorama. If all of the 
images are in perfect registration and identically exposed, this is an easy problem, i.e., any 
pixel or combination will do. However, for real images, visible seams (due to exposure 
differences), blurring (due to misregistration), or ghosting (due to moving objects) can occur. 

Creating clean, pleasing-looking panoramas involves both deciding which pixels to use 
and how to weight or blend them. The distinction between these two stages is a little fluid, 
since per-pixel weighting can be thought of as a combination of selection and blending. In 
this section, we discuss spatially varying weighting, pixel selection (seam placement), and 


then more sophisticated blending. 
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(2) (h) 


Figure 8.14 Final composites computed by a variety of algorithms (Szeliski 2006a): (a) 
average, (b) median, (c) feathered average, (d) p-norm p = 10, (e) Voronoi, (f) weighted 
ROD vertex cover with feathering, (g) graph cut seams with Poisson blending, and (h) with 
pyramid blending. 
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Feathering and center-weighting. The simplest way to create a final composite is to sim- 


ply take an average value at each pixel, 


C(x) = So w(x) (x) / X w(x), (8.70) 
k k 


where [j,(x) are the warped (re-sampled) images and w(x) is 1 at valid pixels and 0 else- 
where. On computer graphics hardware, this kind of summation can be performed in an 
accumulation buffer (using the A channel as the weight). 

Simple averaging usually does not work very well, since exposure differences, misregis- 
trations, and scene movement are all very visible (Figure 8.14a). If rapidly moving objects 
are the only problem, taking a median filter (which is a kind of pixel selection operator) can 
often be used to remove them (Figure 8.14b) (Irani and Anandan 1998). Conversely, center- 
weighting (discussed below) and minimum likelihood selection (Agarwala, Dontcheva et al. 
2004) can sometimes be used to retain multiple copies of a moving object (Figure 8.17). 

A better approach to averaging is to weight pixels near the center of the image more 
heavily and to down-weight pixels near the edges. When an image has some cutout regions, 
down-weighting pixels near the edges of both cutouts and the image is preferable. This can 


be done by computing a distance map or grassfire transform, 
w(x) = arg min{|ly|| | 7, (x + y) is invalid }, (8.71) 
y 


where each valid pixel is tagged with its Euclidean distance to the nearest invalid pixel (Sec- 
tion 3.3.3). The Euclidean distance map can be efficiently computed using a two-pass raster 
algorithm (Danielsson 1980; Borgefors 1986). 

Weighted averaging with a distance map is often called feathering (Szeliski and Shum 
1997; Chen and Klette 1999; Uyttendaele, Eden, and Szeliski 2001) and does a reasonable job 
of blending over exposure differences. However, blurring and ghosting can still be problems 
(Figure 8.14c). Note that weighted averaging is not the same as compositing the individual 
images with the classic over operation (Porter and Duff 1984; Blinn 1994a), even when using 
the weight values (normalized to sum up to one) as alpha (translucency) channels. This is 
because the over operation attenuates the values from more distant surfaces and, hence, is not 
equivalent to a direct sum. 

One way to improve feathering is to raise the distance map values to some large power, 
i.e., to use wh (x) in Equation (8.70). The weighted averages then become dominated by 
the larger values, i.e., they act somewhat like a p-norm. The resulting composite can often 


provide a reasonable tradeoff between visible exposure differences and blur (Figure 8.14d). 
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A 
(a) (b) (c) 


Figure 8.15 Computation of regions of difference (RODs) (Uyttendaele, Eden, and Szeliski 
2001) O 2001 IEEE: (a) three overlapping images with a moving face; (b) corresponding 
RODs; (c) graph of coincident RODs. 


In the limit as p — oo, only the pixel with the maximum weight is selected. This hard 
pixel selection process produces a visibility mask-sensitive variant of the familiar Voronoi 
diagram, which assigns each pixel to the nearest image center in the set (Wood, Finkelstein 
et al. 1997; Peleg, Rousso ef al. 2000). The resulting composite, while useful for artistic 
guidance and in high-overlap panoramas (manifold mosaics) tends to have very hard edges 
with noticeable seams when the exposures vary (Figure 8.14e). 

Xiong and Turkowski (1998) use this Voronoi idea (local maximum of the grassfire trans- 
form) to select seams for Laplacian pyramid blending (which is discussed below). However, 
since the seam selection is performed sequentially as new images are added in, some artifacts 


can occur. 


Optimal seam selection. Computing the Voronoi diagram is one way to select the seams 
between regions where different images contribute to the final composite. However, Voronoi 
images totally ignore the local image structure underlying the seam. A better approach is 
to place the seams in regions where the images agree, so that transitions from one source to 
another are not visible. In this way, the algorithm avoids “cutting through” moving objects 
where a seam would look unnatural (Davis 1998). For a pair of images, this process can be 
formulated as a simple dynamic program starting from one edge of the overlap region and 
ending at the other (Milgram 1975, 1977; Davis 1998; Efros and Freeman 2001). 

When multiple images are being composited, the dynamic program idea does not readily 
generalize. (For square texture tiles being composited sequentially, Efros and Freeman (2001) 
run a dynamic program along each of the four tile sides.) 

To overcome this problem, Uyttendaele, Eden, and Szeliski (2001) observed that, for 
well-registered images, moving objects produce the most visible artifacts, namely translu- 
cent looking ghosts. Their system therefore decides which objects to keep and which ones 
to erase. First, the algorithm compares all overlapping input image pairs to determine re- 


gions of difference (RODs) where the images disagree. Next, a graph is constructed with the 
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Figure 8.16 Photomontage (Agarwala, Dontcheva et al. 2004) O 2004 ACM. From a set 
of five source images (of which four are shown on the left), Photomontage quickly creates 
a composite family portrait in which everyone is smiling and looking at the camera (right). 
Users simply flip through the stack and coarsely draw strokes using the designated source 
image objective over the people they wish to add to the composite. The user-applied strokes 


and computed regions (middle) are color-coded by the borders of the source images on the 


left. 


RODs as vertices and edges representing ROD pairs that overlap in the final composite (Fig- 
ure 8.15). Since the presence of an edge indicates an area of disagreement, vertices (regions) 
must be removed from the final composite until no edge spans a pair of remaining vertices. 
The smallest such set can be computed using a vertex cover algorithm. Since several such 
covers may exist, a weighted vertex cover is used instead, where the vertex weights are com- 
puted by summing the feather weights in the ROD (Uyttendaele, Eden, and Szeliski 2001). 
The algorithm therefore prefers removing regions that are near the edge of the image, which 
reduces the likelihood that partially visible objects will appear in the final composite. (It is 
also possible to infer which object in a region of difference is the foreground object by the 
“edginess” (pixel differences) across the ROD boundary, which should be higher when an 
object is present (Herley 2005).) Once the desired excess regions of difference have been 
removed, the final composite can be created by feathering (Figure 8.14f). 

A different approach to pixel selection and seam placement is described by Agarwala, 
Dontcheva et al. (2004). Their system computes the label assignment that optimizes the sum 
of two objective functions. The first is a per-pixel image objective that determines which 
pixels are likely to produce good composites, 


Ep = X_ D(x, 1(x)), (8.72) 


where D(x, l) is the data penalty associated with choosing image / at pixel x. In their system, 


users can select which pixels to use by “painting” over an image with the desired object or 
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Figure 8.17 Set of five photos tracking a snowboarder's jump stitched together into a 
seamless composite. Because the algorithm prefers pixels near the center of the image, mul- 


tiple copies of the boarder are retained. 


appearance, which sets D(x, 1) to a large value for all labels l other than the one selected 
by the user (Figure 8.16). Alternatively, automated selection criteria can be used, such as 
maximum likelihood, which prefers pixels that occur repeatedly in the background (for object 
removal), or minimum likelihood for objects that occur infrequently, 1.e., for moving object 
retention. Using a more traditional center-weighted data term tends to favor objects that are 
centered in the input images (Figure 8.17). 

The second term is a seam objective that penalizes differences in labels between adjacent 
images, 

Es= Y S(x,y,l(x),1(y)), (8.73) 
(xy )EN 

where S(x,y, lx, ly) is the image-dependent interaction penalty or seam cost of placing a 
seam between pixels x and y, and M is the set of M4 neighboring pixels. For example, 
the simple color-based seam penalty used in Kwatra, Schödl et al. (2003) and Agarwala, 
Dontcheva et al. (2004) can be written as 


S(x,y, le, ly) = [Id (2) — Li, 6911 + li (y) = £, OI (8.74) 


More sophisticated seam penalties can also look at image gradients or the presence of image 
edges (Agarwala, Dontcheva et al. 2004). Seam penalties are widely used in other computer 
vision applications such as stereo matching (Boykov, Veksler, and Zabih 2001) to give the 
labeling function its coherence or smoothness. An alternative approach, which places seams 
along strong consistent edges in overlapping images using a watershed computation is de- 
scribed by Soille (2006). 
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The sum of these two objective functions gives rise to a Markov random field (MRF), 
for which good optimization algorithms are described in Sections 4.3 and 4.3.2 and Ap- 
pendix B.5. For label computations of this kind, the a-expansion algorithm developed by 
Boykov, Veksler, and Zabih (2001) works particularly well (Szeliski, Zabih et al. 2008). 

For the result shown in Figure 8.14g, Agarwala, Dontcheva et al. (2004) use a large data 
penalty for invalid pixels and O for valid pixels. Notice how the seam placement algorithm 
avoids regions of difference, including those that border the image and that might result in 
objects being cut off. Graph cuts (Agarwala, Dontcheva et al. 2004) and vertex cover (Uytten- 
daele, Eden, and Szeliski 2001) often produce similar looking results, although the former is 
significantly slower since it optimizes over all pixels, while the latter is more sensitive to the 
thresholds used to determine regions of difference. More recent approaches to seam selection 
include SEAGULL (Lin, Jiang et al. 2016), which jointly optimizes local alignment and seam 
selection, and object-centered image stitching (Herrmann, Wang et al. 2018a), which uses an 
off-the-shelf object detector to avoid cutting through objects. 


8.4.3 Application: Photomontage 


While image stitching is normally used to composite partially overlapping photographs, it 
can also be used to composite repeated shots of a scene taken with the aim of obtaining the 
best possible composition and appearance of each element. 

Figure 8.16 shows the Photomontage system developed by Agarwala, Dontcheva et al. 
(2004), where users draw strokes over a set of pre-aligned images to indicate which regions 
they wish to keep from each image. Once the system solves the resulting multi-label graph 
cut (8.72-8.73), the various pieces taken from each source photo are blended together using 
a variant of Poisson image blending (8.75-8.77). Their system can also be used to auto- 
matically composite an all-focus image from a series of bracketed focus images (Hasinoff, 
Kutulakos et al. 2009) or to remove wires and other unwanted elements from sets of pho- 


tographs. Exercise 8.14 has you implement this system and try out some of its variants. 


8.4.4 Blending 


Once the seams between images have been determined and unwanted objects removed, we 
still need to blend the images to compensate for exposure differences and other misalign- 
ments. The spatially varying weighting (feathering) previously discussed can often be used 
to accomplish this. However, it is difficult in practice to achieve a pleasing balance between 
smoothing out low-frequency exposure variations and retaining sharp enough transitions to 


prevent blurring (although using a high exponent in feathering can help). 
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Figure 8.18 Poisson image editing (Pérez, Gangnet, and Blake 2003) O 2003 ACM: (a) 
The dog and the two children are chosen as source images to be pasted into the destination 
swimming pool. (b) Simple pasting fails to match the colors at the boundaries, whereas (c) 


Poisson image blending masks these differences. 


Laplacian pyramid blending. An attractive solution to this problem is the Laplacian pyra- 
mid blending technique developed by Burt and Adelson (1983b), which we discussed in Sec- 
tion 3.5.5. Instead of using a single transition width, a frequency-adaptive width is used by 
creating a band-pass (Laplacian) pyramid and making the transition widths within each level 
a function of the level, i.e., the same width in pixels. In practice, a small number of levels, 
i.e., as few as two (Brown and Lowe 2007), may be adequate to compensate for differences 
in exposure. The result of applying this pyramid blending is shown in Figure 8.14h. 


Gradient domain blending. An alternative approach to multi-band image blending is to 
perform the operations in the gradient domain. Reconstructing images from their gradient 
fields has a long history in computer vision (Horn 1986), starting originally with work in 
brightness constancy (Horn 1974), shape from shading (Horn and Brooks 1989), and photo- 
metric stereo (Woodham 1981). Related ideas have also been used for reconstructing images 
from their edges (Elder and Goldberg 2001), removing shadows from images (Weiss 2001), 
separating reflections from a single image (Levin, Zomet, and Weiss 2004; Levin and Weiss 
2007), and tone mapping high dynamic range images by reducing the magnitude of image 
edges (gradients) (Fattal, Lischinski, and Werman 2002). 

Pérez, Gangnet, and Blake (2003) show how gradient domain reconstruction can be used 
to do seamless object insertion in image editing applications (Figure 8.18). Rather than copy- 
ing pixels, the gradients of the new image fragment are copied instead. The actual pixel values 
for the copied area are then computed by solving a Poisson equation that locally matches the 
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gradients while obeying the fixed Dirichlet (exact matching) conditions at the seam bound- 
ary. Pérez, Gangnet, and Blake (2003) show that this is equivalent to computing an additive 
membrane interpolant of the mismatch between the source and destination images along the 
boundary.” In earlier work, Peleg (1981) also proposed adding a smooth function to enforce 
consistency along the seam curve. 

Agarwala, Dontcheva et al. (2004) extended this idea to a multi-source formulation, where 
it no longer makes sense to talk of a destination image whose exact pixel values must be 
matched at the seam. Instead, each source image contributes its own gradient field and the 
Poisson equation is solved using Neumann boundary conditions, 1.e., dropping any equations 
that involve pixels outside the boundary of the image. 

Rather than solving the Poisson partial differential equations, Agarwala, Dontcheva et al. 


(2004) directly minimize a variational problem, 
min [VO (x) — Viw. (8.75) 
The discretized form of this equation is a set of gradient constraint equations 


C(x wie 2) = C(x) = (x) (x Tr 2) — Lux) (x) and (8.76) 
C(x Ez j) = C(x) = £1(x) (x + 3) = Tix) (x), (8.77) 


where ¿ = (1,0) and j = (0, 1) are unit vectors in the x and y directions.”° They then solve 
the associated sparse least squares problem. Since this system of equations is only defined 
up to an additive constraint, Agarwala, Dontcheva et al. (2004) ask the user to select the 
value of one pixel. In practice, a better choice might be to weakly bias the solution towards 
reproducing the original color values. 

In order to accelerate the solution of this sparse linear system, Fattal, Lischinski, and Wer- 
man (2002) use multigrid, whereas Agarwala, Dontcheva et al. (2004) use hierarchical basis 
preconditioned conjugate gradient descent (Szeliski 1990b, 2006b; Krishnan and Szeliski 
2011; Krishnan, Fattal, and Szeliski 2013) (Appendix A.5). In subsequent work, Agarwala 
(2007) shows how using a quadtree representation for the solution can further accelerate the 
computation with minimal loss in accuracy, while Szeliski, Uyttendaele, and Steedly (2008) 
show how representing the per-image offset fields using coarser splines is even faster. This 
latter work also argues that blending in the log domain, i.e., using multiplicative rather than 
additive offsets, is preferable, as it more closely matches texture contrasts across seam bound- 


aries. The resulting seam blending works very well in practice (Figure 8.14h), although care 


25The membrane interpolant is known to have nicer interpolation properties for arbitrary-shaped constraints than 
frequency-domain interpolants (Nielson 1993). 
26 At seam locations, the right-hand side is replaced by the average of the gradients in the two source images. 
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must be taken when copying large gradient values near seams so that a “double edge” is not 
introduced. 

Copying gradients directly from the source images after seam placement is just one ap- 
proach to gradient domain blending. The paper by Levin, Zomet ef al. (2004) examines 
several different variants of this approach, which they call Gradient-domain Image STitching 
(GIST). The techniques they examine include feathering (blending) the gradients from the 
source images, as well as using an L¡norm in performing the reconstruction of the image 
from the gradient field, rather than using an Lanorm as in Equation (8.75). Their preferred 
technique is the Lı optimization of a feathered (blended) cost function on the original image 
gradients (which they call GIST1-/,). Since L, optimization using linear programming can 
be slow, they develop a faster iterative median-based algorithm in a multigrid framework. 
Visual comparisons between their preferred approach and what they call optimal seam on the 
gradients (which is equivalent to the approach of Agarwala, Dontcheva et al. (2004)) show 
similar results, while significantly improving on pyramid blending and feathering algorithms. 


Exposure compensation. Pyramid and gradient domain blending can do a good job of 
compensating for moderate amounts of exposure differences between images. However, 
when the exposure differences become large, alternative approaches may be necessary. 

Uyttendaele, Eden, and Szeliski (2001) iteratively estimate a local correction between 
each source image and a blended composite. First, a block-based quadratic transfer function is 
fit between each source image and an initial feathered composite. Next, transfer functions are 
averaged with their neighbors to get a smoother mapping and per-pixel transfer functions are 
computed by splining (interpolating) between neighboring block values. Once each source 
image has been smoothly adjusted, a new feathered composite is computed and the process is 
repeated (typically three times). The results shown by Uyttendaele, Eden, and Szeliski (2001) 
demonstrate that this does a better job of exposure compensation than simple feathering and 
can handle local variations in exposure due to effects such as lens vignetting. 

Ultimately, however, the most principled way to deal with exposure differences is to stitch 
images in the radiance domain, i.e., to convert each image into a radiance image using its 
exposure value and then create a stitched, high dynamic range image, as discussed in Sec- 
tion 10.2 and Eden, Uyttendaele, and Szeliski (2006). 


8.5 Additional reading 


Hartley and Zisserman (2004) provide a wonderful introduction to the topics of feature-based 


alignment and optimal motion estimation. Techniques for robust estimation are discussed 
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in more detail in Appendix B.3 and in monographs and review articles on this topic (Huber 
1981; Hampel, Ronchetti et al. 1986; Rousseeuw and Leroy 1987; Black and Rangarajan 
1996; Stewart 1999). The most commonly used robust initialization technique in computer 
vision is RANdom SAmple Consensus (RANSAC) (Fischler and Bolles 1981), which has 
spawned a series of more efficient variants (Torr and Zisserman 2000; Nistér 2003; Chum and 
Matas 2005; Raguram, Chum et al. 2012; Brachmann, Krull et al. 2017; Barath and Matas 
2018; Barath, Matas, and Noskova 2019; Brachmann and Rother 2019). The MAGSAC++ 
paper by Barath, Noskova et al. (2020) compares many of these variants. 


The literature on image stitching dates back to work in the photogrammetry community in 
the 1970s (Milgram 1975, 1977; Slama 1980). In computer vision, papers started appearing 
in the early 1980s (Peleg 1981), while the development of fully automated techniques came 
about a decade later (Mann and Picard 1994; Chen 1995; Szeliski 1996; Szeliski and Shum 
1997; Sawhney and Kumar 1999; Shum and Szeliski 2000). Those techniques used direct 
pixel-based alignment but feature-based approaches are now the norm (Zoghlami, Faugeras, 
and Deriche 1997; Capel and Zisserman 1998; Cham and Cipolla 1998; Badra, Qumsieh, and 
Dudek 1998; McLauchlan and Jaenicke 2002; Brown and Lowe 2007). A collection of some 
of these papers can be found in the book by Benosman and Kang (2001). Szeliski (2006a) 
provides a comprehensive survey of image stitching, on which the material in this chapter is 
based. More recent publications include Zaragoza, Chin et al. (2013), Zhang and Liu (2014), 
Lin, Pankanti et al. (2015), Lin, Jiang et al. (2016), Herrmann, Wang et al. (2018b), Lee and 
Sim (2020), and Zhuang and Tran (2020). 


High-quality techniques for optimal seam finding and blending are another important 
component of image stitching systems. Important developments in this field include work by 
Milgram (1977), Burt and Adelson (1983b), Davis (1998), Uyttendaele, Eden, and Szeliski 
(2001), Pérez, Gangnet, and Blake (2003), Levin, Zomet et al. (2004), Agarwala, Dontcheva 
et al. (2004), Eden, Uyttendaele, and Szeliski (2006), Kopf, Uyttendaele et al. (2007), Lin, 
Jiang et al. (2016), and Herrmann, Wang et al. (2018a). 


In addition to the merging of multiple overlapping photographs taken for aerial or ter- 
restrial panoramic image creation, stitching techniques can be used for automated white- 
board scanning (He and Zhang 2005; Zhang and He 2007), scanning with a mouse (Nakao, 
Kashitani, and Kaneyoshi 1998), and retinal image mosaics (Can, Stewart ef al. 2002). They 
can also be applied to video sequences (Teodosio and Bender 1993; Irani, Hsu, and Anandan 
1995; Kumar, Anandan et al. 1995; Sawhney and Ayer 1996; Massey and Bender 1996; Irani 
and Anandan 1998; Sawhney, Arpa et al. 2002; Agarwala, Zheng et al. 2005; Rav-Acha, 
Pritch et al. 2005; Steedly, Pal, and Szeliski 2005; Baudisch, Tan et al. 2006) and can even 
be used for video compression (Lee, Chen et al. 1997). 
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8.6 


Exercises 


Ex 8.1: Feature-based image alignment for flip-book animations. Take a set of photos 


of an action scene or portrait (preferably in burst shooting mode) and align them to make a 


composite or flip-book animation. 


Le 


6. 


Extract features and feature descriptors using some of the techniques described in Sec- 
tions 7.1.1-7.1.2. 


. Match your features using nearest neighbor matching with a nearest neighbor distance 


ratio test (7.18). 


. Compute an optimal 2D translation and rotation between the first image and all subse- 


quent images, using least squares (Section 8.1.1) with optional RANSAC for robustness 
(Section 8.1.4). 


. Resample all of the images onto the first image’s coordinate frame (Section 3.6.1) using 


either bilinear or bicubic resampling and optionally crop them to their common area. 


. Convert the resulting images into an animated GIF (using software available from the 


web) or optionally implement cross-dissolves to turn them into a “slo-mo” video. 


(Optional) Combine this technique with feature-based (Exercise 3.25) morphing. 


Ex 8.2: Panography. Create the kind of panograph discussed in Section 8.1.2 and com- 


monly found on the web. 


1. 


2. 


Take a series of interesting overlapping photos. 


Use the feature detector, descriptor, and matcher developed in Exercises 7.1-7.4 (or 


existing software) to match features among the images. 


. Turn each connected component of matching features into a track, i.e., assign a unique 


index 7 to each track, discarding any tracks that are inconsistent (contain two different 
features in the same image). 


. Compute a global translation for each image using Equation (8.12). 


. Since your matches probably contain errors, turn the above least square metric into a 


robust metric (8.25) and re-solve your system using iteratively reweighted least squares. 


. Compute the size of the resulting composite canvas and resample each image into its 


final position on the canvas. (Keeping track of bounding boxes will make this more 
efficient.) 
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. Average all of the images, or choose some kind of ordering and implement translucent 


over compositing (3.8). 


. (Optional) Extend your parametric motion model to include rotations and scale, i.e., 


the similarity transform given in Table 8.1. Discuss how you could handle the case of 


translations and rotations only (no scale). 


. (Optional) Write a simple tool to let the user adjust the ordering and opacity, and add 


or remove images : 


(Optional) Write down a different least squares problem that involves pairwise match- 
ing of images. Discuss why this might be better or worse than the global matching 


formula given in (8.12). 


Ex 8.3: 2D rigid/Euclidean matching. Several alternative approaches are given in Sec- 


tion 8.1.3 for estimating a 2D rigid (Euclidean) alignment. 


1. 


Implement the various alternatives and compare their accuracy on synthetic data, i.e., 
random 2D point clouds with noisy feature positions. 


. One approach is to estimate the translations from the centroids and then estimate ro- 


tation in polar coordinates. Do you need to weight the angles obtained from a polar 


decomposition in some way to get the statistically correct estimate? 


. How can you modify your techniques to take into account either scalar (8.10) or full 


two-dimensional point covariance weightings (8.11)? Do all of the previously devel- 


oped “shortcuts” still work or does full weighting require iterative optimization? 


Ex 8.4: 2D match move/augmented reality. Replace a picture in a magazine or a book 


with a different image or video. 


Í: 


2. 


Take a picture of a magazine or book page. 


Outline a figure or picture on the page with a rectangle, i.e., draw over the four sides as 


they appear in the image. 


. Match features in this area with each new image frame. 


. Replace the original image with an “advertising” insert, warping the new image with 


the appropriate homography. 


. Try your approach on a clip from a sporting event (e.g., indoor or outdoor soccer) to 


implement a billboard replacement. 
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Ex 8.5: Direct pixel-based alignment. Take a pair of images, compute a coarse-to-fine 
affine alignment (Exercise 9.2) and then blend them using either averaging (Exercise 8.2) or 
a Laplacian pyramid (Exercise 3.18). Extend your motion model from affine to perspective 
(homography) to better deal with rotational mosaics and planar surfaces seen under arbitrary 


motion. 


Ex 8.6: Featured-based stitching. Extend your feature-based alignment technique from 
Exercise 8.2 to use a full perspective model and then blend the resulting mosaic using ei- 


ther averaging or more sophisticated distance-based feathering (Exercise 8.13). 


Ex 8.7: Cylindrical strip panoramas. To generate cylindrical or spherical panoramas from 
a horizontally panning (rotating) camera, it is best to use a tripod. Set your camera up to take 


a series of 50% overlapped photos and then use the following steps to create your panorama: 


1. Estimate the amount of radial distortion by taking some pictures with lots of long 
straight lines near the edges of the image and then using the plumb-line method from 


Exercise 11.5. 


2. Compute the focal length either by using a ruler and paper (Debevec, Wenger et al. 
2002) or by rotating your camera on the tripod, overlapping the images by exactly 0% 


and counting the number of images it takes to make a 360° panorama. 
3. Convert each of your images to cylindrical coordinates using (8.45-8.49). 


4. Line up the images with a translational motion model using either a direct pixel-based 


technique, such as coarse-to-fine incremental or an FFT, or a feature-based technique. 


5. (Optional) If doing a complete 360° panorama, align the first and last images. Compute 
the amount of accumulated vertical misregistration and re-distribute this among the 


images. 
6. Blend the resulting images using feathering or some other technique. 


Ex 8.8: Coarse alignment. Use FFT or phase correlation (Section 9.1.2) to estimate the 
initial alignment between successive images. How well does this work? Over what range of 


overlaps? If it does not work, does aligning sub-sections (e.g., quarters) do better? 


Ex 8.9: Automated mosaicing. Use feature-based alignment with four-point RANSAC for 
homographies (Section 8.1.3, Equations (8.19-8.23)) or three-point RANSAC for rotational 


motions (Brown, Hartley, and Nistér 2007) to match up all pairs of overlapping images. 
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Merge these pairwise estimates together by finding a spanning tree of pairwise relations. 
Visualize the resulting global alignment, e.g., by displaying a blend of each image with all 
other images that overlap it. 

For greater robustness, try multiple spanning trees (perhaps randomly sampled based on 
the confidence in pairwise alignments) to see 1f you can recover from bad pairwise matches 
(Zach, Klopschitz, and Pollefeys 2010). As a measure of fitness, count how many pairwise 


estimates are consistent with the global alignment. 


Ex 8.10: Global optimization. Use the initialization from the previous algorithm to per- 
form a full bundle adjustment over all of the camera rotations and focal lengths, as described 
in Section 11.4.2 and by Shum and Szeliski (2000). Optionally, estimate radial distortion 
parameters as well or support fisheye lenses (Section 2.1.5). 

As in the previous exercise, visualize the quality of your registration by creating compos- 
ites of each input image with its neighbors, optionally blinking between the original image 


and the composite to better see misalignment artifacts. 


Ex 8.11: Deghosting. Use the results of the previous bundle adjustment to predict the lo- 
cation of each feature in a consensus geometry. Use the difference between the predicted 
and actual feature locations to correct for small misregistrations, as described in Section 8.3.2 
(Shum and Szeliski 2000). 


Ex 8.12: Compositing surface. Choose a compositing surface (Section 8.4.1), e.g., a sin- 
gle reference image extended to a larger plane, a sphere represented using cylindrical or 
spherical coordinates, a stereographic “little planet” projection, or a cube map. 

Project all of your images onto this surface and blend them with equal weighting, for now 
(just to see where the original image seams are). 


Ex 8.13: Feathering and blending. Compute a feather (distance) map for each warped 
source image and use these maps to blend the warped images. 

Alternatively, use Laplacian pyramid blending (Exercise 3.18) or gradient domain blend- 
ing. 


Ex 8.14: Photomontage and object removal. Implement a “Photomontage” system in which 
users can indicate desired or unwanted regions in pre-registered images using strokes or other 
primitives (such as bounding boxes). 

(Optional) Devise an automatic moving objects remover (or “keeper”) by analyzing which 
inconsistent regions are more or less typical given some consensus (e.g., median filtering) of 


the aligned images. Figure 8.17 shows an example where the moving object was kept. Try 
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to make this work for sequences with large amounts of overlaps and consider averaging the 


images to make the moving object look more ghosted. 
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initial layers final layers 


layers with pixel assignments and flow 


(d) 


(f) 
Figure 9.1 Motion estimation: (a—b) regularization-based optical flow (Nagel and Enkel- 
mann 1986) O 1986 IEEE; (c-d) layered motion estimation (Wang and Adelson 1994) O 1994 


IEEE; (e-f) sample image and ground truth flow from evaluation database (Butler, Wulff et 
al. 2012) O 2012 Springer. 


9 Motion estimation 557 


Algorithms for aligning images and estimating motion in video sequences are among the most 
widely used in computer vision. For example, frame-rate image alignment is widely used in 
digital cameras to implement their image stabilization (IS) feature. 

An early example of a widely used image registration algorithm is the patch-based trans- 
lational alignment (optical flow) technique developed by Lucas and Kanade (1981). Variants 
of this algorithm are used in almost all motion-compensated video compression schemes 
such as MPEG/H.263 (Le Gall 1991) and HEVC/H.265 (Sullivan, Ohm et al. 2012). Similar 
parametric motion estimation algorithms have found a wide variety of applications, including 
video summarization (Teodosio and Bender 1993; Irani and Anandan 1998), video stabiliza- 
tion (Hansen, Anandan et al. 1994; Srinivasan, Chellappa et al. 2005; Matsushita, Ofek et al. 
2006), and video compression (Irani, Hsu, and Anandan 1995; Lee, Chen et al. 1997). More 
sophisticated image registration algorithms have also been developed for medical imaging 
and remote sensing. Image registration techniques are surveyed by Brown (1992), Zitov’aa 
and Flusser (2003), Goshtasby (2005), and Szeliski (2006a). 

To estimate the motion between two or more images, a suitable error metric must first 
be chosen to compare the images (Section 9.1). Once this has been established, a suitable 
search technique must be devised. The simplest technique is to exhaustively try all possible 
alignments, i.e., to do a full search. In practice, this may be too slow, so hierarchical coarse- 
to-fine techniques (Section 9.1.1) based on image pyramids are normally used. Alternatively, 
Fourier transforms (Section 9.1.2) can be used to speed up the computation. 

To get sub-pixel precision in the alignment, incremental methods (Section 9.1.3) based 
on a Taylor series expansion of the image function are often used. These can also be applied 
to parametric motion models (Section 9.2), which model global image transformations such 
as rotation or shearing. Motion estimation can be made more reliable by learning the typical 
dynamics or motion statistics of the scenes or objects being tracked, e.g., the natural gait of 
walking people (Section 9.2). For more complex motions, piecewise parametric spline motion 
models (Section 9.2.2) can be used. 

In the presence of multiple independent (and perhaps non-rigid) motions, general-purpose 
optical flow (or optic flow) techniques need to be used, as described in Section 9.3. In re- 
cent years, the best-performing techniques have started using deep neural networks (Sec- 
tion 9.3.1). For even more complex motions that include a lot of occlusions, layered motion 
models (Section 9.4), which decompose the scene into coherently moving layers, can work 
well. Such representations can also be used to perform video object segmentation (Section 
9.4.3) and object tracking (Section 9.4.4). 

In this chapter, we describe each of these techniques in more detail. Additional details 


can be found in review and comparative evaluation papers on motion estimation (Barron, 
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Fleet, and Beauchemin 1994; Mitiche and Bouthemy 1996; Stiller and Konrad 1999; Szeliski 
2006a; Baker, Scharstein et al. 2011; Sun, Yang et al. 2018; Janai, Giiney et al. 2020; Hur 
and Roth 2020). 


9.1 Translational alignment 


The simplest way to establish an alignment between two images or image patches is to shift 
one image relative to the other. Given a template image Ip(x) sampled at discrete pixel 
locations {x; = (£i, y;)}, we wish to find where it is located in image I, (x). A least squares 
solution to this problem is to find the minimum of the sum of squared differences (SSD) 


function 


Essp(u) = Y I(x: + u) — Io(xi)? = ej, (9.1) 


i 
where u = (u,v) is the displacement and e; = Iı (x; + u) — Ip(x;) is called the residual 
error (or the displaced frame difference in the video coding literature).! (We ignore for the 
moment the possibility that parts of Jọ may lie outside the boundaries of J, or be otherwise 
not visible.) The assumption that corresponding pixel values remain the same in the two 
images is often called the brightness constancy constraint.? 

In general, the displacement u can be fractional, so a suitable interpolation function must 
be applied to image J, (x). In practice, a bilinear interpolant is often used, but bicubic inter- 
polation can yield slightly better results (Szeliski and Scharstein 2004). Color images can be 
processed by summing differences across all three color channels, although it is also possible 
to first transform the images into a different color space or to only use the luminance (which 


is often done in video encoders). 


Robust error metrics. We can make the above error metric more robust to outliers by re- 
placing the squared error terms with a robust function p(e;) (Huber 1981; Hampel, Ronchetti 
et al. 1986; Black and Anandan 1996; Stewart 1999) to obtain 


Esro(u) = X` ph (xi + u) — To(x:)) = X` pei). (9.2) 
The robust norm p(e) is a function that grows less quickly than the quadratic penalty associ- 


ated with least squares. One such function, sometimes used in motion estimation for video 


'The usual justification for using least squares is that it is the optimal estimate with respect to Gaussian noise. 
See the discussion below on robust error metrics as well as Appendix B.3. 
?Brightness constancy (Horn 1974) is the tendency for objects to maintain their perceived brightness under vary- 


ing illumination conditions. 
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coding because of its speed, is the sum of absolute differences (SAD) metric? or Lı norm, 
i.e., 


Esap(u) = Y |B (x; + u) — Io(x:)| =D _ les). (9.3) 


i i 
However, because this function is not differentiable at the origin, it is not well suited to 
gradient-descent approaches such as the ones presented in Section 9.1.3. 

Instead, a smoothly varying function that is quadratic for small values but grows more 
slowly away from the origin is often used. Black and Rangarajan (1996) discuss a variety of 


such functions, including the Geman—McClure function, 


q? 


a 4 
ae (9.4) 


pam (2) 


where a is a constant that can be thought of as an outlier threshold. An appropriate value 
for the threshold can itself be derived using robust statistics (Huber 1981; Hampel, Ronchetti 
et al. 1986; Rousseeuw and Leroy 1987), e.g., by computing the median absolute deviation, 
MAD = med;le;|, and multiplying it by 1.4 to obtain a robust estimate of the standard devi- 
ation of the inlier noise process (Stewart 1999). Barron (2019) proposes a generalized robust 
loss function that can model various outlier distributions and thresholds, as discussed in more 
detail in Sections 4.1.3 and Appendix B.3, and also has a Bayesian method for estimating the 


loss function parameters. 


Spatially varying weights. The error metrics above ignore that fact that for a given align- 
ment, some of the pixels being compared may lie outside the original image boundaries. 
Furthermore, we may want to partially or completely downweight the contributions of cer- 
tain pixels. For example, we may want to selectively “erase” some parts of an image from 
consideration when stitching a mosaic where unwanted foreground objects have been cut out. 
For applications such as background stabilization, we may want to downweight the middle 
part of the image, which often contains independently moving objects being tracked by the 
camera. 

All of these tasks can be accomplished by associating a spatially varying per-pixel weight 
with each of the two images being matched. The error metric then becomes the weighted (or 


windowed) SSD function, 


Ewssp(u) = PR wo (xi) wi (x; + u) [8 (x; + u) — Jo(xi)]?, (9.5) 


2 


3In video compression, e.g., the H.264 standard (https://www.itu.int/rec/T-REC-H.264), the sum of absolute 
transformed differences (SATD), which measures the differences in a frequency transform space, e.g., using a 


Hadamard transform, is often used, as it more accurately predicts quality (Richardson 2003). 
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where the weighting functions wo and w; are zero outside the image boundaries. 
If a large range of potential motions is allowed, the above metric can have a bias towards 
smaller overlap solutions. To counteract this bias, the windowed SSD score can be divided 


by the overlap area 


A= 5 wo(x;)w1 (Xi + u) (9.6) 


to compute a per-pixel (or mean) squared pixel error Ewssp/A. The square root of this 


quantity is the root mean square intensity error 


RMS = \/Ewssp/A (9.7) 


often reported in comparative studies. 


Bias and gain (exposure differences). Often, the two images being aligned were not taken 
with the same exposure. A simple model of linear (affine) intensity variation between the two 


images is the bias and gain model, 
I(x + u) = (1+ a)Ip(x) +6, (9.8) 


where 8 is the bias and a is the gain (Lucas and Kanade 1981; Gennert 1988; Fuh and 
Maragos 1991; Baker, Gross, and Matthews 2003; Evangelidis and Psarakis 2008). The least 


squares formulation then becomes 


Epe(u) = XOL (x: +u) — (1 + a)Jo(xi) — BI? =D lalo(xi) + e]. (9.9) 


t 2 


Rather than taking a simple squared difference between corresponding patches, it becomes 
necessary to perform a linear regression (Appendix A.2), which is somewhat more costly. 
Note that for color images, it may be necessary to estimate a different bias and gain for each 
color channel to compensate for the automatic color correction performed by some digital 
cameras (Section 2.3.2). Bias and gain compensation are also used in video codecs, where 
they are known as weighted prediction (Richardson 2003). 

A more general (spatially varying, non-parametric) model of intensity variation, which is 
computed as part of the registration process, is used in Negahdaripour (1998), Jia and Tang 
(2003), and Seitz and Baker (2009). This can be useful for dealing with local variations 
such as the vignetting caused by wide-angle lenses, wide apertures, or lens housings. It is 
also possible to pre-process the images before comparing their values, e.g., using band-pass 
filtered images (Anandan 1989; Bergen, Anandan et al. 1992), or gradients (Scharstein 1994; 


Papenberg, Bruhn et al. 2006), or using other local transformations such as histograms or rank 
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transforms (Cox, Roy, and Hingorani 1995; Zabih and Woodfill 1994), or to maximize mutual 
information (Viola and Wells III 1997; Kim, Kolmogorov, and Zabih 2003). Hirschmiiller 
and Scharstein (2009) compare a number of these approaches and report on their relative 
performance in scenes with exposure differences. 


Correlation. An alternative to taking intensity differences is to perform correlation, i.e., to 


maximize the product (or cross-correlation) of the two aligned images, 


Eoo(u => To (xi) i (x; + u). (9.10) 


At first glance, this may appear to make bias and gain modeling unnecessary, since the images 
will prefer to line up regardless of their relative scales and offsets. However, this is actually 
not true. If a very bright patch exists in J; (x), the maximum product may actually lie in that 
area. 


For this reason, normalized cross-correlation is more commonly used, 


> Ho(x:) — Lo] h(x: +u) — fi] 


Enoc(u = —; (9.11) 
AT [Lo(x;) — To]?4/ Sil x: + u) — HP 
where 
Tk = ~ 5 Lo(x;) and (9.12) 
a Lt (9.13) 


are the mean images of the corresponding patches and N is the number of pixels in the patch. 
The normalized cross-correlation score is always guaranteed to be in the range [—1, 1], which 
makes it easier to handle in some higher-level applications, such as deciding which patches 
truly match. Normalized correlation works well when matching images taken with different 
exposures, e.g., when creating high dynamic range images (Section 10.2). Note, however, 
that the NCC score is undefined if either of the two patches has zero variance (and, in fact, its 
performance degrades for noisy low-contrast regions). 

A variant on NCC, which is related to the bias—gain regression implicit in the matching 


score (9.9), is the normalized SSD score 


als — To] — [A(x +) - TJ 
E N Tol? + [1 +u) — Ti]? 


Enssp(u (9.14) 
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proposed by Criminisi, Shotton et al. (2007). In their experiments, they find that it produces 
comparable results to NCC, but is more efficient when applied to a large number of overlap- 


ping patches using a moving average technique (Section 3.2.2). 


9.1.1 Hierarchical motion estimation 


Now that we have a well-defined alignment cost function to optimize, how can we find its 
minimum? The simplest solution is to do a full search over some range of shifts, using ei- 


ther integer or sub-pixel steps. This is often the approach used for block matching in motion 


compensated video compression, where a range of possible motions (say, +16 pixels) is ex- 
plored.* 

To accelerate this search process, hierarchical motion estimation is often used: an image 
pyramid (Section 3.5) is constructed and a search over a smaller number of discrete pixels 
(corresponding to the same range of motion) is first performed at coarser levels (Quam 1984; 
Anandan 1989; Bergen, Anandan et al. 1992). The motion estimate from one level of the 
pyramid is then used to initialize a smaller local search at the next finer level. Alternatively, 
several seeds (good solutions) from the coarse level can be used to initialize the fine-level 
search. While this is not guaranteed to produce the same result as a full search, it usually 
works almost as well and is much faster. 


More formally, let 
IV, e IUD 2X; 9.15 
k (Xy) k (2x;) (9.15) 


be the decimated image at level | obtained by subsampling (downsampling) a smoothed ver- 
sion of the image at level /—1. See Section 3.5 for how to perform the required downsampling 
(pyramid construction) without introducing too much aliasing. 

At the coarsest level, we search for the best displacement ul) that minimizes the dif- 
ference between images TO and T. CP This is usually done using a full search over some 
range of displacements u® € 27! [—S, S]?, where S is the desired search range at the finest 
(original) resolution level, optionally followed by the incremental refinement step described 
in Section 9.1.3. 

Once a suitable motion vector has been estimated, it is used to predict a likely displace- 


ment 


àD) e a (9.16) 


4In stereo matching (Section 12.1.2), an explicit search over all possible disparities (i.e., a plane sweep) is al- 
most always performed, as the number of search hypotheses is much smaller due to the 1D nature of the potential 


displacements. 
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for the next finer level. The search over displacements is then repeated at the finer level over 


a much narrower range of displacements, say ú('=1) + 1, again optionally combined with an 
incremental refinement step (Anandan 1989). Alternatively, one of the images can be warped 
(resampled) by the current motion estimate, in which case only small incremental motions 
need to be computed at the finer level. A nice description of the whole process, extended to 


parametric motion estimation (Section 9.2), is provided by Bergen, Anandan et al. (1992). 


9.1.2 Fourier-based alignment 


When the search range corresponds to a significant fraction of the larger image (as is the case 
in image stitching, see Section 8.2), the hierarchical approach may not work that well, as it 
is often not possible to coarsen the representation too much before significant features are 
blurred away. In this case, a Fourier-based approach may be preferable. 

Fourier-based alignment relies on the fact that the Fourier transform of a shifted signal 


has the same magnitude as the original signal, but a linearly varying phase (Section 3.4), i.e., 
F{h(x+u)} = F {L(x} A = (w) e, (9.17) 


where w is the vector-valued angular frequency of the Fourier transform and we use cal- 
ligraphic notation Z;(w) = F41,(x)) to denote the Fourier transform of a signal (Sec- 
tion 3.4). 

Another useful property of Fourier transforms is that convolution in the spatial domain 
corresponds to multiplication in the Fourier domain (Section 3.4).° The Fourier transform of 


the cross-correlation function Ecc can thus be written as 
F {Eoo(u)} =F +) Tox) (x: +u) p = F {Io(u)* h (u)} = Zo(w)T] (w), (9.18) 


where 


f(u)zg(u) = > f(xi)g(x: + u) (9.19) 


is the correlation function, i.e., the convolution of one signal with the reverse of the other, 
and Zł (w) is the complex conjugate of Tı (w). This is because convolution is defined as the 


summation of one signal with the reverse of the other (Section 3.4). 


5This doubling of displacements is only necessary if displacements are defined in integer pixel coordinates, which 
is the usual case in the literature (Bergen, Anandan et al. 1992). If normalized device coordinates (Section 2.1.4) are 
used instead, the displacements (and search ranges) need not change from level to level, although the step sizes will 
need to be adjusted, to keep search steps of roughly one pixel or less. 

®Tn fact, the Fourier shift property (9.17) derives from the convolution theorem by observing that shifting is 


equivalent to convolution with a displaced delta function 6(x — u). 
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To efficiently evaluate Ecc over the range of all possible values of u, we take the Fourier 
transforms of both images fp (x) and J; (x), multiply both transforms together (after conjugat- 
ing the second one), and take the inverse transform of the result. The Fast Fourier Transform 
algorithm can compute the transform of an N x M image in O(NM log NM) operations 
(Bracewell 1986). This can be significantly faster than the O(N? M?) operations required to 
do a full search when the full range of image overlaps is considered. 

While Fourier-based convolution is often used to accelerate the computation of image 
correlations, it can also be used to accelerate the sum of squared differences function (and its 


variants). Consider the SSD formula given in (9.1). Its Fourier transform can be written as 


F {Essp(u)} = F (Din +u) — nes} 


i 


= 5(w) X Uo (xi) + I (xi)] — 22o(w) TZ; (w). (9.20) 


Thus, the SSD function can be computed by taking twice the correlation function and sub- 


tracting it from the sum of the energies in the two images (or patches). 


Windowed correlation. Unfortunately, the Fourier convolution theorem only applies when 
the summation over x; is performed over all the pixels in both images, using a circular shift 
of the image when accessing pixels outside the original boundaries. While this is acceptable 
for small shifts and comparably sized images, it makes no sense when the images overlap by 
a small amount or one image is a small subset of the other. 

In that case, the cross-correlation function should be replaced with a windowed (weighted) 
cross-correlation function, 


Ewcc(u) = > wo(x;Mo(x:;) wi (x; + u) (x; + u), (9.21) 


= [wo(x)Lo(x)]*[t1 (1 (x)] (9.22) 


where the weighting functions wo and w; are zero outside the valid ranges of the images 
and both images are padded so that circular shifts return O values outside the original image 
boundaries. 
An even more interesting case is the computation of the weighted SSD function intro- 
duced in Equation (9.5), 
Ewsso(u) = X wo(x;)w1 (xi + 0) [8 (x; + u) — Zo(x;)]?. (9.23) 


Expanding this as a sum of correlations and deriving the appropriate set of Fourier transforms 
1s left for Exercise 9.1. 
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The same kind of derivation can also be applied to the bias—gain corrected sum of squared 
difference function Eg (9.9). Again, Fourier transforms can be used to efficiently compute 
all the correlations needed to perform the linear regression in the bias and gain parameters in 
order to estimate the exposure-compensated difference for each potential shift (Exercise 9.1). 
It is also possible to use Fourier transforms to estimate the rotation and scale between two 
patches that are centered on the same pixel, as described in De Castro and Morandi (1987) 
and Szeliski (2010, Section 8.1.2). 


Phase correlation. A variant of regular correlation (9.18) that is sometimes used for motion 
estimation is phase correlation (Kuglin and Hines 1975; Brown 1992). Here, the spectrum of 
the two signals being matched is whitened by dividing each per-frequency product in (9.18) 


by the magnitudes of the Fourier transforms, 


Zo (co) Ti (w) 


PEAD MZ Al 


(9.24) 
before taking the final inverse Fourier transform. In the case of noiseless signals with perfect 


(cyclic) shift, we have I; (x + u) = Ip(x) and hence, from Equation (9.17), we obtain 


F {I(x +u)} = Zi (w)e 277 =7To(w) and 
F {Epo(u)} = ere, (9.25) 


The output of phase correlation (under ideal conditions) is therefore a single spike (impulse) 
located at the correct value of u, which (in principle) makes it easier to find the correct 
estimate. 

Phase correlation has a reputation in some quarters of outperforming regular correlation, 
but this behavior depends on the characteristics of the signals and noise. If the original images 
are contaminated by noise in a narrow frequency band (e.g., low-frequency noise or peaked 
frequency “hum”), the whitening process effectively de-emphasizes the noise in these regions. 
However, if the original signals have very low signal-to-noise ratio at some frequencies (say, 
two blurry or low-textured images with lots of high-frequency noise), the whitening process 
can actually decrease performance (see Exercise 9.1). 

Gradient cross-correlation has emerged as a promising alternative to phase correlation 
(Argyriou and Vlachos 2003), although further systematic studies are probably warranted. 
Phase correlation has also been studied by Fleet and Jepson (1990) as a method for estimating 


general optical flow and stereo disparity. 
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Figure 9.2 Taylor series approximation of a function and the incremental computation of 
the optical flow correction amount. J,(x; + u) is the image gradient at (x; + u) and e; is 


the current intensity difference. 


9.1.3 Incremental refinement 


The techniques described up till now can estimate alignment to the nearest pixel (or poten- 
tially fractional pixel if smaller search steps are used). In general, image stabilization and 
stitching applications require much higher accuracies to obtain acceptable results. 

To obtain better sub-pixel estimates, we can use one of several techniques described by 
Tian and Huhns (1986). One possibility is to evaluate several discrete (integer or fractional) 
values of (u, v) around the best value found so far and to interpolate the matching score to 
find an analytic minimum (Szeliski and Scharstein 2004). 

A more commonly used approach, first proposed by Lucas and Kanade (1981), is to 
perform gradient descent on the SSD energy function (9.1), using a Taylor series expansion 


of the image function (Figure 9.2), 


ELx-sso(u + Au) = X [h (x; + u + Au) — Ip(x,)]? (9.26) 
~ X [h(x tu) + Ji (x; + u)Au — T(x)? (9.27) 
= S [Fi (xi + u)Au + ej)’, (9.28) 

where a 

= (AU 1 
Ji(x; + u) = V(x; + u) (E ; n) (x; + u) (9.29) 
is the image gradient or Jacobian at (x; + u) and 

ej = T(x; + u) = Lo(x;), (9.30) 


first introduced in (9.1), is the current intensity error.” The gradient at a particular sub-pixel 


location (x; + u) can be computed using a variety of techniques, the simplest of which is 


7We follow the convention, commonly used in robotics and by Baker and Matthews (2004), that derivatives with 


respect to (column) vectors result in row vectors, so that fewer transposes are needed in the formulas. 
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simply to take the horizontal and vertical differences between pixels x and x + (1,0) or 
x + (0,1). More sophisticated derivatives can sometimes lead to noticeable performance 
improvements. 

The linearized form of the incremental update to the SSD error (9.28) is often called the 


optical flow constraint or brightness constancy constraint equation (Horn and Schunck 1981) 
Izut+ Iyu+ i, =0, (9.31) 


where the subscripts in J, and J, denote spatial derivatives, and J; is called the temporal 
derivative, which makes sense if we are computing instantaneous velocity in a video se- 
quence. When squared and summed or integrated over a region, it can be used to compute 
optical flow (Horn and Schunck 1981). 

The above least squares problem (9.28) can be minimized by solving the associated nor- 
mal equations (Appendix A.2), 


AAu=b (9.32) 
where 
A= 5 JT (x; +03, (x; + u) (9.33) 
and 
b=- Y eJT(x; +u) (9.34) 


are called the (Gauss—Newton approximation of the) Hessian and gradient-weighted residual 
vector, respectively.* These matrices are also often written as 


E LI LI 
aty y yt 


The gradients required for Jı (x; + u) can be evaluated at the same time as the image 
warps required to estimate J; (x; + u) (Section 3.6.1 (3.75)) and, in fact, are often computed 
as a side-product of image interpolation. If efficiency is a concern, these gradients can be 


replaced by the gradients in the template image, 
Ji (x; +u) xX Jo(x;), (9.36) 


because near the correct alignment, the template and displaced target images should look 
similar. This has the advantage of allowing the precomputation of the Hessian and Jacobian 


8The true Hessian is the full second derivative of the error function Æ, which may not be positive definite—see 
Section 8.1.3 and Appendix A.3. 
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images, which can result in significant computational savings (Hager and Belhumeur 1998; 
Baker and Matthews 2004). A further reduction in computation can be obtained by writing 
the warped image Jı (x; + u) used to compute e; in (9.30) as a convolution of a sub-pixel 
interpolation filter with the discrete samples in J; (Peleg and Rav-Acha 2006). Precomputing 
the inner product between the gradient field and shifted version of J; allows the iterative 
re-computation of e; to be performed in constant time (independent of the number of pixels). 

The effectiveness of the above incremental update rule relies on the quality of the Taylor 
series approximation. When far away from the true displacement (say, 1-2 pixels), several 
iterations may be needed. It is possible, however, to estimate a value for Jı using a least 
squares fit to a series of larger displacements to increase the range of convergence (Jurie and 
Dhome 2002) or to “learn” a special-purpose recognizer for a given patch (Avidan 2001; 
Williams, Blake, and Cipolla 2003; Lepetit, Pilet, and Fua 2006; Hinterstoisser, Benhimane 
et al. 2008; Ozuysal, Calonder et al. 2010) as discussed in Section 7.1.5. 

A commonly used stopping criterion for incremental updating is to monitor the magnitude 
of the displacement correction ||u|| and to stop when it drops below a certain threshold (say, 
1/19 of a pixel). For larger motions, it is usual to combine the incremental update rule with a 


hierarchical coarse-to-fine search strategy, as described in Section 9.1.1. 


Conditioning and aperture problems. Sometimes, the inversion of the linear system (9.32) 
can be poorly conditioned because of lack of two-dimensional texture in the patch being 
aligned. A commonly occurring example of this is the aperture problem, first identified in 
some of the early papers on optical flow (Horn and Schunck 1981) and then studied more ex- 
tensively by Anandan (1989). Consider an image patch that consists of a slanted edge moving 
to the right (Figure 7.4). Only the normal component of the velocity (displacement) can be 
reliably recovered in this case. This manifests itself in (9.32) as a rank-deficient matrix A, 
i.e., one whose smaller eigenvalue is very close to zero.” 

When Equation (9.32) is solved, the component of the displacement along the edge is very 
poorly conditioned and can result in wild guesses under small noise perturbations. One way 
to mitigate this problem is to add a prior (soft constraint) on the expected range of motions 
(Simoncelli, Adelson, and Heeger 1991; Baker, Gross, and Matthews 2004; Govindu 2006). 
This can be accomplished by adding a small value to the diagonal of A, which essentially 
biases the solution towards smaller Au values that still (mostly) minimize the squared error. 

However, the pure Gaussian model assumed when using a simple (fixed) quadratic prior, 


as in Simoncelli, Adelson, and Heeger (1991), does not always hold in practice, e.g., because 


°The matrix A is by construction always guaranteed to be symmetric positive semi-definite, i.e., it has real 


non-negative eigenvalues. 
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of aliasing along strong edges (Triggs 2004). For this reason, 1t may be prudent to add some 
small fraction (say, 5%) of the larger eigenvalue to the smaller one before doing the matrix 


inversion. 


Uncertainty modeling. The reliability of a particular patch-based motion estimate can be 
captured more formally with an uncertainty model. The simplest such model is a covariance 
matrix, which captures the expected variance in the motion estimate in all possible directions. 
As discussed in Section 8.1.4 and Appendix B.6, under small amounts of additive Gaussian 
noise, it can be shown that the covariance matrix Xu is proportional to the inverse of the 
Hessian A, 

= = 02AT}, (9.37) 


where a? is the variance of the additive Gaussian noise (Anandan 1989; Matthies, Kanade, 
and Szeliski 1989; Szeliski 1989). 

For larger amounts of noise, the linearization performed by the Lucas—Kanade algorithm 
in (9.28) is only approximate, so the above quantity becomes a Cramer—Rao lower bound on 
the true covariance. Thus, the minimum and maximum eigenvalues of the Hessian A can now 
be interpreted as the (scaled) inverse variances in the least-certain and most-certain directions 
of motion. (A more detailed analysis using a more realistic model of image noise is given by 
Steele and Jaynes (2005).) Figure 7.5 shows the local SSD surfaces for three different pixel 
locations in an image. As you can see, the surface has a clear minimum in the highly textured 


region and suffers from the aperture problem near the strong edge. 


Bias and gain, weighting, and robust error metrics. The Lucas—Kanade update rule can 
also be applied to the bias—gain equation (9.9) to obtain 


Exx—pe(ut Au) = Y [31 (x; + u)Au + e; — alo(x;) — BI? (9.38) 
(Lucas and Kanade 1981; Gennert 1988; Fuh and Maragos 1991; Baker, Gross, and Matthews 
2003). The resulting 4 x 4 system of equations can be solved to simultaneously estimate the 
translational displacement update Au and the bias and gain parameters 8 and a. 
A similar formulation can be derived for images (templates) that have a linear appearance 
variation, 


L(x+u) © h(x Ie RR (9.39) 


where the B;(x) are the basis images and the A; are the unknown coefficients (Hager and 
Belhumeur 1998; Baker, Gross ef al. 2003; Baker, Gross, and Matthews 2003). Potential 
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linear appearance variations include illumination changes (Hager and Belhumeur 1998) and 
small non-rigid deformations (Black and Jepson 1998; Kambhamettu, Goldgof et al. 2003). 


A weighted (windowed) version of the Lucas—Kanade algorithm is also possible: 


ELx-wsso(u + Au) = Y  wo(x:)w1 (x; +4) [31 + u)Au + e4]?. (9.40) 


Note that here, in deriving the Lucas—Kanade update from the original weighted SSD function 
(9.5), we have neglected taking the derivative of the 101 (x; + u) weighting function with 
respect to u, which is usually acceptable in practice, especially 1f the weighting function is a 
binary mask with relatively few transitions. 

Baker, Gross et al. (2003) only use the wo(x) term, which is reasonable if the two images 
have the same extent and no (independent) cutouts in the overlap region. They also discuss 
the idea of making the weighting proportional to V(x), which helps for very noisy images, 
where the gradient itself is noisy. Similar observations, formulated in terms of total least 
squares (Van Huffel and Vandewalle 1991; Van Huffel and Lemmerling 2002), have been 
made by other researchers studying optical flow (Weber and Malik 1995; Bab-Hadiashar and 
Suter 1998b; Miihlich and Mester 1998). Baker, Gross et al. (2003) show how evaluating 
Equation (9.40) at just the most reliable (highest gradient) pixels does not significantly reduce 
performance for large enough images, even if only 5-10% of the pixels are used. (This 
idea was originally proposed by Dellaert and Collins (1999), who used a more sophisticated 
selection criterion.) 

The Lucas—Kanade incremental refinement step can also be applied to the robust error 


metric introduced in Section 9.1, 


Exx—srp(u+ Au) = $` p(J1(x; + u)Au + e;), (9.41) 
which can be solved using the iteratively reweighted least squares technique described in 
Section 8.1.4. 


9.2 Parametric motion 


Many image alignment tasks, for example image stitching with handheld cameras, require the 
use of more sophisticated motion models, as described in Section 2.1.1. As these models, e.g., 
affine deformations, typically have more parameters than pure translation, a full search over 
the possible range of values is impractical. Instead, the incremental Lucas—Kanade algorithm 
can be generalized to parametric motion models and used in conjunction with a hierarchical 
search algorithm (Lucas and Kanade 1981; Rehg and Witkin 1991; Fuh and Maragos 1991; 
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Bergen, Anandan et al. 1992; Shashua and Toelg 1997; Shashua and Wexler 2001; Baker and 
Matthews 2004). 

For parametric motion, instead of using a single constant translation vector u, we use 
a spatially varying motion field or correspondence map, x'(x; p), parameterized by a low- 
dimensional vector p, where x’ can be any of the motion models presented in Section 2.1.1. 


The parametric incremental motion update rule now becomes 


ELx-pm(p + Ap) = op + Ap)) — Io(x:)]? (9.42) 
x y L(x!) + Ju(xi) Ap — Io(x:)]? (9.43) 
= >> (Ji (x/)Ap + e,]?, (9.44) 


where the Jacobian is now 


fa) 1 
E (x) 55 Oi): (9.45) 


i.e., the product of the image gradient VJ, with the Jacobian of the correspondence field, 
Jy = Ox’ /Op. 

The motion Jacobians J,’ for the 2D planar transformations introduced in Section 2.1.1 
and Table 2.1 are given in Table 8.1. Note how we have re-parameterized the motion matrices 
so that they are always the identity at the origin p = 0. This becomes useful later, when we 
talk about the compositional and inverse compositional algorithms. (It also makes it easier to 
impose priors on the motions.) 

For parametric motion, the (Gauss—Newton) Hessian and gradient-weighted residual vec- 
tor become 

A= SO Jo) [VI (x) VL (xi) Sx (x) (9.46) 


and 


Y Ih (x) le VIT (x))).- (9.47) 


Note how the expressions inside the square brackets are the same ones evaluated for the 


simpler translational motion case (9.33-9.34). 


Patch-based approximation. The computation of the Hessian and residual vectors for 
parametric motion can be significantly more expensive than for the translational case. For 
parametric motion with n parameters and N pixels, the accumulation of A and b takes 
O(n?N) operations (Baker and Matthews 2004). One way to reduce this by a significant 
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amount is to divide the image up into smaller sub-blocks (patches) P; and to only accumulate 
the simpler 2 x 2 quantities inside the square brackets at the pixel level (Shum and Szeliski 
2000), 


A; = VE )VE() (9.48) 
(EP; 

b; = 5 e; VIF (x!). (9.49) 
i€ Pj 


The full Hessian and residual can then be approximated as 


AR Y IDR VIT EV (x) de (y) = X IERA) (0.50) 
J 


(EP; j 


and 
x- IA) evi) = — SLID (&)by, (9.51) 
j 1€P; j 
where x; is the center of each patch P; (Shum and Szeliski 2000). This is equivalent to 
replacing the true motion Jacobian with a piecewise-constant approximation. In practice, this 


works quite well. 


Compositional approach. For a complex parametric motion such as a homography, the 
computation of the motion Jacobian becomes complicated and may involve a per-pixel divi- 
sion. Szeliski and Shum (1997) observed that this can be simplified by first warping the target 


image J, according to the current motion estimate x’ (x; p), 
L(x) = 1 (x(x; p)), (9.52) 


and then comparing this warped image against the template Jp(x). Subsequently Hager 
and Belhumeur (1998) suggested replacing the gradient of 1, (x) with the gradient of Zo(x), 
as described previously in (9.36), which allows the precomputation (and inversion) of the 
Hessian matrix A given in (9.46). The residual vector b (9.47) can also be partially pre- 
computed, i.e., the steepest descent images V Ip(x)Jx(x) can be precomputed and stored for 
later multiplication with the e(x) = 1,(x) — I(x) error images, as described in (Szeliski 
2010, Section 8.2) and (Baker and Matthews 2004), where this is called the inverse additive 
scheme. Baker and Matthews (2004) also introduce one more variant they call the inverse 
compositional algorithm where they warp the template image Jp(x) and precompute the in- 
verse Hessian and the steepest descent images, which makes it the preferred approach. They 


also discuss the advantage of using Gauss—Newton iteration (i.e., the first-order expansion 
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of the least squares, as above) compared to other approaches such as steepest descent and 
Levenberg-Marquardt. 

Subsequent parts of the series (Baker, Gross ef al. 2003; Baker, Gross, and Matthews 
2003, 2004) discuss more advanced topics such as per-pixel weighting, pixel selection for 
efficiency, a more in-depth discussion of robust metrics and algorithms, linear appearance 
variations, and priors on parameters. They make for invaluable reading for anyone interested 
in implementing a highly tuned implementation of incremental image registration and have 
been widely used as components of subsequent object trackers, which are discussed in Sec- 
tion 9.4.4. Evangelidis and Psarakis (2008) provide some detailed experimental evaluations 
of these and other related approaches. 


Learned motion models 


An alternative to parameterizing the motion field with a geometric deformation such as an 
affine transform is to learn a set of basis functions tailored to a particular application (Black, 
Yacoob et al. 1997). First, a set of dense motion fields (Section 9.3) is computed from a set of 
training videos. Next, singular value decomposition (SVD) is applied to the stack of motion 
fields u,(x) to compute the first few singular vectors v(x). Finally, for a new test sequence, 
a novel flow field is computed using a coarse-to-fine algorithm that estimates the unknown 


coefficient a; in the parameterized flow field 


u(x) = Y agvr(x). (9.53) 
k 


Figure 9.3a shows a set of basis fields learned by observing videos of walking motions. 
Figure 9.3b shows the temporal evolution of the basis coefficients as well as a few of the 
recovered parametric motion fields. Note that similar ideas can also be applied to feature 
tracks (Torresani, Hertzmann, and Bregler 2008), which is a topic we discuss in more detail 
in Sections 7.1.5 and 13.6.4, as well as video stabilization (Yu and Ramamoorthi 2020). 


9.2.1 Application: Video stabilization 


Video stabilization is one of the most widely used applications of parametric motion estima- 
tion (Hansen, Anandan et al. 1994; Irani, Rousso, and Peleg 1997; Morimoto and Chellappa 
1997; Srinivasan, Chellappa et al. 2005; Grundmann, Kwatra, and Essa 2011). Algorithms 
for stabilization run inside both hardware devices, such as camcorders and still cameras, and 


software packages for improving the visual quality of shaky videos. 
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(a) (b) 


Figure 9.3 Learned parameterized motion fields for a walking sequence (Black, Yacoob et 
al. 1997) © 1997 IEEE: (a) learned basis flow fields; (b) plots of motion coefficients over time 


and corresponding estimated motion fields. 


In their paper on full-frame video stabilization, Matsushita, Ofek ef al. (2006) give a 
nice overview of the three major stages of stabilization, namely motion estimation, motion 
smoothing, and image warping. Motion estimation algorithms often use a similarity trans- 
form to handle camera translations, rotations, and zooming. The tricky part is getting these 
algorithms to lock onto the background motion, which is a result of the camera movement, 
without getting distracted by independently moving foreground objects (Yu and Ramamoorthi 
2018, 2020; Yu, Ramamoorthi et al. 2021). Motion smoothing algorithms recover the low- 
frequency (slowly varying) part of the motion and then estimate the high-frequency shake 
component that needs to be removed. While quadratic penalties on motion derivatives are 
commonly used, more realistic virtual camera motions (locked and linear) can be obtained 
using Lı minimization of derivatives (Grundmann, Kwatra, and Essa 2011). Finally, image 
warping algorithms apply the high-frequency correction to render the original frames as if the 
camera had undergone only the smooth motion. 


The resulting stabilization algorithms can greatly improve the appearance of shaky videos 
but they often still contain visual artifacts. For example, image warping can result in missing 
borders around the image, which must be cropped, filled using information from other frames, 
or hallucinated using inpainting techniques (Section 10.5.1). Furthermore, video frames cap- 
tured during fast motion are often blurry. Their appearance can be improved either by using 
deblurring techniques (Section 10.3) or by stealing sharper pixels from other frames with less 
motion or better focus (Matsushita, Ofek et al. 2006). Exercise 9.3 has you implement and 


test some of these ideas. 


In situations where the camera is translating a lot in 3D, e.g., when the videographer is 
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walking, an even better approach is to compute a full structure from motion reconstruction of 
the camera motion and 3D scene. One or more smooth 3D camera paths can then be computed 
and the original video re-rendered using view interpolation with the interpolated 3D point 
cloud serving as the proxy geometry while preserving salient features in what is sometimes 
called content preserving warps (Liu, Gleicher et al. 2009, 2011; Liu, Yuan et al. 2013; Kopf, 
Cohen, and Szeliski 2014). If you have access to a camera array instead of a single video 
camera, you can do even better using a light field rendering approach (Section 14.3) (Smith, 
Zhang et al. 2009). 


9.2.2 Spline-based motion 


While parametric motion models are useful in a wide variety of applications (such as video 
stabilization and mapping onto planar surfaces), most image motion is too complicated to be 
captured by such low-dimensional models. 

Traditionally, optical flow algorithms (Section 9.3) compute an independent motion esti- 
mate for each pixel, i.e., the number of flow vectors computed is equal to the number of input 
pixels. The general optical flow analog to Equation (9.1) can thus be written as 


Essp—or({ui}) = Y [i + us) — I(x). (9.54) 


2 


Notice how in the above equation, the number of variables {u;} is twice the number of 
measurements, so the problem is underconstrained. 

The two classic approaches to this problem, which we study in Section 9.3, are to per- 
form the summation over overlapping regions (the patch-based or window-based approach) 
or to add smoothness terms on the {u;} field using regularization or Markov random fields 
(Chapter 4). In this section, we describe an alternative approach that lies somewhere between 
general optical flow (independent flow at each pixel) and parametric flow (a small number of 
global parameters). The approach is to represent the motion field as a two-dimensional spline 


controlled by a smaller number of control vertices {û;} (Figure 9.4), 
u; = >) aj By(x:) = $ aywis, (9.55) 
j j 


where the B;(x;) are called the basis functions and are only non-zero over a small finite sup- 
port interval (Szeliski and Coughlan 1997). We call the w;; = B;(x;) weights to emphasize 
that the {u;} are known linear combinations of the {û; }. 

Substituting the formula for the individual per-pixel flow vectors u; (9.55) into the SSD 


error metric (9.54) yields a parametric motion formula similar to Equation (9.43). The biggest 
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Figure 9.4 Spline motion field: the displacement vectors u; = (uj, vi) are shown as pluses 
(+) and are controlled by the smaller number of control vertices Ús = (ú;, 0), which are 


shown as circles (0). 


difference is that the Jacobian J; (x/) (9.45) now consists of the sparse entries in the weight 
matrix W = [ws]. 

In situations where we know something more about the motion field, e.g., when the mo- 
tion is due to a camera moving in a static scene, we can use more specialized motion models. 
For example, the plane plus parallax model (Section 2.1.4) can be naturally combined with 
a spline-based motion representation, where the in-plane motion is represented by a homog- 
raphy (8.19) and the out-of-plane parallax d is represented by a scalar variable at each spline 
control point (Szeliski and Kang 1995; Szeliski and Coughlan 1997). 

In many cases, the small number of spline vertices results in a motion estimation problem 
that is well conditioned. However, if large textureless regions (or elongated edges subject 
to the aperture problem) persist across several spline patches, it may be necessary to add a 
regularization term to make the problem well posed (Section 4.2). The simplest way to do 
this is to directly add squared difference penalties between adjacent vertices in the spline 
control mesh {û;}, as in (4.24). If a multi-resolution (coarse-to-fine) strategy is being used, 


1t is important to re-scale these smoothness terms while going from level to level. 


The linear system corresponding to the spline-based motion estimator is sparse and reg- 
ular. Because it is usually of moderate size, it can often be solved using direct techniques 
such as Cholesky decomposition (Appendix A.4). Alternatively, if the problem becomes 
too large and subject to excessive fill-in, iterative techniques such as hierarchically precon- 
ditioned conjugate gradient (Szeliski 1990b, 2006b; Krishnan and Szeliski 2011; Krishnan, 
Fattal, and Szeliski 2013) can be used instead (Appendix A.5). 


Because of its robustness, spline-based motion estimation has been used for a number 


of applications, including visual effects (Roble 1999) and medical image registration (Sec- 
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(a) (b) (d) 


Figure 9.5 Quadtree spline-based motion estimation (Szeliski and Shum 1996) O 1996 
IEEE: (a) quadtree spline representation, (b) which can lead to cracks, unless the white nodes 


are constrained to depend on their parents; (c) deformed quadtree spline mesh overlaid on 


grayscale image; (d) flow field visualized as a needle diagram. 


tion 9.2.3) (Szeliski and Lavallée 1996; Kybic and Unser 2003). 

One disadvantage of the basic technique, however, is that the model does a poor job 
near motion discontinuities, unless an excessive number of nodes are used. To remedy this 
situation, Szeliski and Shum (1996) propose using a quadtree representation embedded in the 
spline control grid (Figure 9.5a). Large cells are used to present regions of smooth motion, 
while smaller cells are added in regions of motion discontinuities (Figure 9.5c). 

To estimate the motion, a coarse-to-fine strategy is used. Starting with a regular spline 
imposed over a lower-resolution image, an initial motion estimate is obtained. Spline patches 
where the motion is inconsistent, i.e., the squared residual (9.54) is above a threshold, are 
subdivided into smaller patches. To avoid cracks in the resulting motion field (Figure 9.5b), 
the values of certain nodes in the refined mesh, i.e., those adjacent to larger cells, need to be 
restricted so that they depend on their parent values. This is most easily accomplished using 
a hierarchical basis representation for the quadtree spline (Szeliski 1990b) and selectively 
setting some of the hierarchical basis functions to 0, as described in (Szeliski and Shum 
1996). 


9.2.3 Application: Medical image registration 


Because they excel at representing smooth elastic deformation fields, spline-based motion 
models have found widespread use in medical image registration (Bajcsy and Kovacic 1989; 
Szeliski and Lavallée 1996; Christensen, Joshi, and Miller 1997).'° Registration techniques 
can be used both to track an individual patient’s development or progress over time (a lon- 


gitudinal study) or to match different patient images together to find commonalities and de- 


10In computer graphics, such elastic volumetric deformations are known as free-form deformations (Sederberg 
and Parry 1986; Coquillart 1990; Celniker and Gossard 1991). 
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(a) (b) 


Figure 9.6 Elastic brain registration (Kybic and Unser 2003) O 2003 IEEE: (a) original 
brain atlas and patient MRI images overlaid in red-green; (b) after elastic registration with 
eight user-specified landmarks (not shown); (c) a cubic B-spline deformation field, shown as 


a deformed grid. 


tect variations or pathologies (cross-sectional studies). When different imaging modalities 
are being registered, e.g., computed tomography (CT) scans and magnetic resonance images 
(MRI), mutual information measures of similarity are often necessary (Viola and Wells III 
1997; Maes, Collignon et al. 1997). 

Kybic and Unser (2003) provide a nice literature review and describe a complete working 
system based on representing both the images and the deformation fields as multi-resolution 
splines. Figure 9.6 shows an example of the Kybic and Unser system being used to register a 
patient’s brain MRI with a labeled brain atlas image. The system can be run in a fully auto- 
matic mode but more accurate results can be obtained by locating a few key landmarks. More 
recent papers on deformable medical image registration, including performance evaluations, 
include Klein, Staring, and Pluim (2007), Glocker, Komodakis et al. (2008), and the survey 
by Sotiras, Davatzikos, and Paragios (2013). 

As with other applications, regular volumetric splines can be enhanced using selective 
refinement. In the case of 3D volumetric image or surface registration, these are known as 
octree splines (Szeliski and Lavallée 1996) and have been used to register medical surface 


models such as vertebrae and faces from different patients (Figure 9.7). 


9.3 Optical flow 


The most general (and challenging) version of motion estimation is to compute an indepen- 


dent estimate of motion at each pixel, which is generally known as optical (or optic) flow. As 
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(a) (b) 


Figure 9.7 Octree spline-based image registration of two vertebral surface models (Szeliski 
and Lavallée 1996) © 1996 Springer: (a) after initial rigid alignment; (b) after elastic align- 
ment; (c) a cross-section through the adapted octree spline deformation field. 


we mentioned in the previous section, this generally involves minimizing the brightness or 


color difference between corresponding pixels summed over the image, 


Egssp-or({us}) = Soli («i + ui) — I(x). (9.56) 
Because the number of variables {u;} is twice the number of measurements, the problem 
is underconstrained. The two classic approaches to this problem are to perform the summa- 
tion locally over overlapping regions (the patch-based or window-based approach) or to add 
smoothness terms on the (u;) field using regularization or Markov random fields (Chapter 4) 
and to search for a global minimum. Good overviews of recent optical flow algorithms can be 
found in Baker, Scharstein et al. (2011), Sun, Yang et al. (2018), Janai, Giiney et al. (2020), 
and Hur and Roth (2020). 

The patch-based approach usually involves using a Taylor series expansion of the dis- 
placed image function (9.28) to obtain sub-pixel estimates (Lucas and Kanade 1981). Anan- 
dan (1989) shows how a series of local discrete search steps can be interleaved with Lucas— 
Kanade incremental refinement steps in a coarse-to-fine pyramid scheme, which allows the 
estimation of large motions, as described in Section 9.1.1. He also analyzes how the uncer- 
tainty in local motion estimates is related to the eigenvalues of the local Hessian matrix A; 
(9.37), as shown in Figures 7.4 and 7.5. 

Bergen, Anandan et al. (1992) develop a unified framework for describing both parametric 
(Section 9.2) and patch-based optical flow algorithms and provide a nice introduction to this 
topic. After each iteration of optical flow estimation in a coarse-to-fine pyramid, they re- 


warp one of the images so that only incremental flow estimates are computed (Section 9.1.1). 
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When overlapping patches are used, an efficient implementation is to first compute the outer 
products of the gradients and intensity errors (9.339,34) at every pixel and then perform the 
overlapping window sums using a moving average filter.'! 

Instead of solving for each motion (or motion update) independently, Horn and Schunck 
(1981) develop a regularization-based framework where (9.56) is simultaneously minimized 
over all flow vectors {u;}. To constrain the problem, smoothness constraints, i.e., squared 
penalties on flow derivatives, are added to the basic per-pixel error metric. Because the 
technique was originally developed for small motions in a variational (continuous func- 
tion) framework, the linearized brightness constancy constraint corresponding to (9.28), i.e., 


(9.31), is more commonly written as an analytic integral 


Geek / (Tate + yu + 1)? de dy, (9.57) 


where (Iz, I) = Vh = Ji, I: = e, is the temporal derivative, i.e., the brightness change 
between images, and u(x, y) and v(x, y) are the 2D optical flow functions. The Horn and 
Schunck model can also be viewed as the limiting case of spline-based motion estimation as 
the splines become 1 x 1 pixel patches. 

It is also possible to combine ideas from local and global flow estimation into a single 
framework by using a locally aggregated (as opposed to single-pixel) Hessian as the bright- 
ness constancy term (Bruhn, Weickert, and Schnorr 2005). Consider the discrete analog 
(9.28) to the analytic global energy (9.57), 


Esso = X uf (3,37]u; +2e,JFu + ef. (9.58) 


t 


If we replace the per-pixel (rank 1) Hessians A; = [J;J T] and residuals b; = J;e; with area- 
aggregated versions (9.33-9.34), we obtain a global minimization algorithm where region- 
based brightness constraints are used. 

Another extension to the basic optical flow model is to use a combination of global (para- 
metric) and local motion models. For example, if we know that the motion is due to a camera 
moving in a static scene (rigid motion), we can re-formulate the problem as the estimation of 
a per-pixel depth along with the parameters of the global camera motion (Adiv 1989; Hanna 
1991; Bergen, Anandan et al. 1992; Szeliski and Coughlan 1997; Nir, Bruckstein, and Kim- 
mel 2008; Wedel, Cremers et al. 2009). Such techniques are closely related to stereo matching 
(Chapter 12). Alternatively, we can estimate either per-image or per-segment affine motion 
models combined with per-pixel residual corrections (Black and Jepson 1996; Ju, Black, and 


11 Other smoothing or aggregation filters can also be used at this stage (Bruhn, Weickert, and Schnorr 2005). 
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Jepson 1996; Chang, Tekalp, and Sezan 1997; Mémin and Pérez 2002). We revisit this topic 
in Section 9.4. 

Of course, image brightness may not always be an appropriate metric for measuring ap- 
pearance consistency, e.g., when the lighting in an image is varying. As discussed in Sec- 
tion 9.1, matching gradients, filtered images, or other metrics such as image Hessians (sec- 
ond derivative measures) may be more appropriate. It is also possible to locally compute the 
phase of steerable filters in the image, which is insensitive to both bias and gain transfor- 
mations (Fleet and Jepson 1990). Papenberg, Bruhn et al. (2006) review and explore such 
constraints and also provide a detailed analysis and justification for iteratively re-warping 
images during incremental flow computation. 

Because the brightness constancy constraint is evaluated at each pixel independently, 
rather than being summed over patches where the constant flow assumption may be violated, 
global optimization approaches tend to perform better near motion discontinuities. This is 
especially true if robust metrics are used in the smoothness constraint (Black and Anandan 
1996; Bab-Hadiashar and Suter 1998a).'? One popular choice for robust metrics is the Ly 
norm, also known as total variation (TV), which results in a convex energy whose global 
minimum can be found (Bruhn, Weickert, and Schnórr 2005; Papenberg, Bruhn et al. 2006; 
Zach, Pock, and Bischof 2007b; Zimmer, Bruhn, and Weickert 2011). Anisotropic smooth- 
ness priors, which apply a different smoothness in the directions parallel and perpendicular to 
the image gradient, are another popular choice (Nagel and Enkelmann 1986; Sun, Roth et al. 
2008; Werlberger, Trobin et al. 2009; Werlberger, Pock, and Bischof 2010). It is also possible 
to learn a set of better smoothness constraints (derivative filters and robust functions) from a 
set of paired flow and intensity images (Sun, Roth et al. 2008). Many of these techniques are 
discussed in more detail by Baker, Scharstein et al. (2011) and Sun, Roth, and Black (2014). 

Because of the large, two-dimensional search space in estimating flow, most algorithms 
use variations of gradient descent and coarse-to-fine continuation methods to minimize the 
global energy function. This contrasts starkly with stereo matching, which is an “easier” 
one-dimensional disparity estimation problem, where combinatorial optimization techniques 
were the method of choice until the advent of deep neural networks.'? One way to deal 
with this complexity is to start with efficient patch-based correspondences (Kroeger, Tim- 
ofte et al. 2016). Another way to deal with the large two-dimensional search space is to 
integrate sparse feature matches into a variational formulation, as was initially proposed by 
Brox and Malik (2010a). This approach was later extended by several authors, including 


Robust brightness metrics (Section 9.1, (9.2)) can also help improve the performance of window-based ap- 
proaches (Black and Anandan 1996). 

13 Some exceptions to this trend of not exploring the full 4D cost volume can be found in Xu, Ranftl, and Koltun 
(2017) and Teed and Deng (2020b). 
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Optical flow evaluation results Statistics: Average SD R05 R10 R20 A50 A75 A95 
Error type: endpoint angle interpolation normalized interpolation 


Average Army Mequon Schefflera Wooden Grove Urban Yosemite Teddy 
endpoint (Hidden texture) (Hidden texture) (Hidden texture) (Hidden texture) (Synthetic) (Synthetic) (Synthetic) (Stereo) 


error avg] GT im0 im GT im0 imi GT im0 imi GT im0 im GT im0 im GT im0 im GT im0 im GT imd im 
rank} al disc untext| all disc untet aj disc untext| all disc untext| all disc untext] all disc untext| all disc untext| all disc untext 

0.481 0.913 0.10:| 0.883 1.253 0.735| 0.503 1.283 0.313) 0.14 100.16 120.22 10] 0.653 1.373 0.794 

0.192 0.975 0.123|0.97 10 1.316 1.0011] 1.7820 1.737 0.87 u| 0.114 0.122 0.2210] 0.681 1.484 0.953 


‘Adaptive [20] 44] 0.09: 0.26: 0.06: 0.23 
[Complementary OF [24]| 5.7 | 0.115 0.283 0.109 


Aniso. Huber-L1 [22] | 5.8 | 0.105 0.283 0.085|0,31 1 € 


0.20: 0.92: 0.135] 0.842 1.202 0.702| 0.391 1.231 0.281]0.17 15 0.159 0.27 16| 0.642 1.362 0.794 
DPOF [18] 6.1 [0.1312 0.3512 0.09: | 0.256 0.795 O 310,192 0.621 0.1511] 0.74; 1.091 0.49; 1.8010 0.635|0,1917 0.17 14 0.35 20] 
TV-L1-improved [17] | 7.2} 0.091 0.26: 0.072] 0.203 0.713 0.162| 0.537 1.189 0.225| 0,217 1.2411 0.112} 0.90: 1.316 0.725/1.51 14 1.9311 0.84 1110,18 15 0.17 14 0.31 17| 0.73 
CBF [12] 7.8/|0.103 0.283 0.094|0.34 12 0.806 0.37 13| 0.435 0.955 0.265| 0.217 1.143 0.135| 0.904 1.274 0.827| 0,412 1.23: 0.302|0.232 0.1920 0.3921] 0.769 1.566 1.029 
Broxetal.(5} |8:4|0.11s 0.323 0.11 12| 0.279 0.9310 0.223| 0.39: 0.94: 0.247| 0.243 1.2512 0.135] 1.1015 1.3912 1.43 17| 0.895 1.778 0.557| 0.102 0.134 0.111|0.91 11 1.8312 1.1312 
Rannacher [23] 8.5/0,115 0.316 0.094| 0,255 0.847 0.215|0.57 12 1.27 15 0.268] 0,249 1.3214 0.135] 0.917 1.333 0.723/1.4913 1.9513 0.789 0.147 0.26 13| 0.696 1.583 0.866 
FTVS] | 8.8 [0.1415 0.35 12 0.14 19] 0.34 12 0.98 12 0.26 11] 0.59 14 1.19 10 0.268 | 0.27 13 1.36 15 0.16 12] 0.90: 1.305 0.766| 0,54: 1.626 0.36:| 0.136 0.159 0.209| 0.68: 1.566 0.662 
Second-order prior [8] | 9.0 | 0.115 0.316 0.094) 0,263 0.93 10 0.207/0.57 12 1.2514 0.268] 0.204 1.046 0.123] 0.943 1.349 0.839] 0.616 1.9311 0.47 6 | 0.20 15 0.16 12 0.34 19] 0.77 10 1.64 10 1.07 10] 


Fusion [6] 9.4 | 0.115 0.3410 0.103] 0.192 0.692 0.162| 0.292 0.662 0.235| 0.20: 1.1910 0.149| 1.07 1 1.4213 1.22 13] 1.3510 1.495 0.86 13) 0.20 13 0.2021 0.26 13] 1.07 14 2.07 16 1.39 16 
Dynamic MRF [7] _ | 11.1] 0.12 11 0.34 100.11 12] 0.22: 0.892 0.162] 0.446 1.137 0.202| 0.249 1.2913 0.149] 1.14 1 1.52 17 1.13 12) 1.54 15 2.37 20 0.93 19] 0.136 0.122 0.31 17|1.27 182.3320 1.66 17 
SegOF [10] 14.7] 0.15 14 0.36 12 0.103 | 0.57 15 1.16 15 0.59 19) 0.68 15 1.24 12 0.64 14| 0.32 15 0.862 0.26 15) 1.18 17 1.50 16 1.47 10] 1.63 15 2.09 14 0.96 16] 0.081 0.13: 0.122] 0.707 1.505 0.693 


Learning Flow [11] | 13.3] 0.115 0.328 0.094] 0.29 10 0.99 13 0.23 10] 0.559 1.24 12 0.29 12] 0.36 16 1.56 17 0.25 14] 1.25 19 1.6421 1.41 16] 1.55 17 2.32 19 0.85 12] 0,14 10 0.18 15 0.24 12] 1.09 15 2.09 18 1.27 13] 

Fitter Flow [19] | 14.3]0.17 16 0.39 16 0.13 14] 0.43 14 1.09 14 0.38 14) 0.75 15 1.34 16 0.78 19] 0.70 19 1.54 16 0.6819] 1.13 15 1.3811 1.51 19] 0.575 1.324 0.445 0.2220 0.2325 0.26 1310.96 12 1.66 11 1.1211 

| GraphCuts [14] 14.5] 0.16 15 0.38 15 0.14 15] 0.59 15 1.36 19 0.46 15| 0.58 + 076 0.64 14) 0.26 12 1.143 0.17 13| 0.969 1.35 10 0.84 10) 2. 1.799 1.22 21] 0.2220 0.17 14 0.43 22} 1.22 17 2.05 15 1.78 19) 

‘Black & Anandan [4] |15.0] 8 17 0.42 17 0.19 18) 0.58 17 1.31 17 0.50 16) 0.95 19 1.58 15 0.70 16| 0.49 17 1.59 180.45 17 1.4213 1.2213 412.28 17 0.83 10] 0.15 120.17 14 0.176 | 1.11 16 1.9814 1.30 14 

SPSA-tearn [13] |15.7/0.18:7 0.45 18 0.17 17] 0.57 15 1.32 18 0.51 17] 0.84 17 1.50 17 0.72 17] 0.52 18 1.64 19 0.49 1a] 1.12 15 1.42 19 1.39 15] 1.75 19 2.14 15 1.0620) 0.136 0.134 0.197| 1.3219 2.0817 1.7319 

GroupFlow [9] | 15.910.214 19 0.51 19 0.21 19] 0.7921 1.6921 0.7221] 0.86 18 1.64 190.74 129 22 0.827|1.9421 2.30 18 1.3622] 0.114 0.147 0.197] 1.06 13 1.96 13 1.3518 
2D-CLG [1] 17.4] 0.28 Ic |: 


21 0.6222 0.21 19] 0.67 20 1.21 16 0.70 20] 1.1221 1.8021 0.99. 1.2318 1.52 17 1.6222 1.54 15 2.15 16 0.96 16| 0.102 0.111 0.164 | 1.3820 2.2619 1.8329 
Horn 8 Schunck [3] | 18.6] 0.22 20 0.5520 0.2221] 0.64 19 1.5320 0.52 19) 4.04 20 1.7320 0.80 20| 0,78 20 2.02 20 0.77 20] 1,26 20 1.58 19 1.55.20] 1,43 11 2.5922 1.00 14) 0.16 14 0.18 18 0.153| 1.5421 2.5021 1.8821 
TLDOFE [24] 19.6] 0.38 23 0.64 23 0.47 23) 1.16 22 1.72 22 1.26 22] 1.39 23 2.06 24 1.17 23] 1.29 23 2.21 23 1.41 23) 1.27 21 1.61 20 1.57 21| 1.289 2.5721 1.01 19} 0.136 0.159 0.164/187 22.712 2.53 22 
FOLKI[16] 22.6/0.2922 0.732: 0.33 22] 1.5223 1.9624 1.8023) 1.2322 2.0423 0.9521] 0.9921 2.2022 1.0821] 1,5323 1.8523 2.07 235|2.142 3.232 1.6023) 0.2625 0.21 2 0.682[26723 3.2723 4.3223 
Pyramid LK [2] [23.70.392 0.61 21 0.61 z| 1.672: 1.7825 2.00 24] 1.5022 1.9722 1.3824] 1,57 24 2.3924 1.78 24| 2.94 24 3.72.24 2.98 24] 3.33 u 2.7423 2.43 24] 0.30 24 0.2424 0.7324] 3.80 24 5.082: 4.8824 
Move the mouse over the numbers in the table to see the corresponding images. Click to compare with the ground truth. 
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Figure 9.8 Evaluation of the results of 24 optical flow algorithms, October 2009, https: 


//vision.middlebury.edu/flow, (Baker, Scharstein et al. 2009). By moving the mouse pointer 
over an underlined performance score, the user can interactively view the corresponding flow 
and error maps. Clicking on a score toggles between the computed and ground truth flows. 
Next to each score, the corresponding rank in the current column is indicated by a smaller 
blue number. The minimum (best) score in each column is shown in boldface. The table is 
sorted by the average rank (computed over all 24 columns, three region masks for each of the 
eight sequences). The average rank serves as an approximate measure of performance under 


the selected metric/statistic. 
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Weinzaepfel, Revaud et al. (2013), whose DeepFlow system use a hand-crafted (non-learnt) 
convolutional network to compute initial quasi-dense correspondences, and Revaud, Weinza- 
epfel et al. (2015), whose EpicFlow system added an edge and occlusion-aware interpolation 
step before the variational optimization. 

Combinatorial optimization methods based on Markov random fields were among the 
better-performing methods on the optical flow database of Baker, Scharstein et al. (2011)'* 
when it was originally released, but have now been overtaken by deep neural networks. Ex- 
amples of such techniques include the one developed by Glocker, Paragios et al. (2008), who 
use a coarse-to-fine strategy with per-pixel 2D uncertainty estimates, which are then used to 
guide the refinement and search at the next finer level. Lempitsky, Roth, and Rother (2008) 
use fusion moves (Lempitsky, Rother, and Blake 2007) over proposals generated from basic 
flow algorithms (Horn and Schunck 1981; Lucas and Kanade 1981) to find good solutions. 

A careful empirical analysis of these kinds of “classic” coarse-to-fine energy-minimization 
approaches is provided in the meticulously executed paper by Sun, Roth, and Black (2014).!* 
Figure 9.9a shows the main components of the framework they examine, including an initial 
warping based on the previous level’s flow (or a grid search at the coarsest level), followed by 
energy minimizing flow updates, and then an optional post-processing step. In their paper, the 
authors not only review dozens of variational (energy-minimization) approaches developed 
from the 1980s (Horn and Schunck 1981) through to 2013, but also show that algorithmic 
details such as median filtering post-processing, often glossed over by previous authors, have 
a strong influence on the results. In addition to performing their analysis on the Middlebury 
Flow dataset (Baker, Scharstein et al. 2011), they also evaluate on the newer Sintel dataset 
(Butler, Wulff et al. 2012).!% 

The field of accurate motion estimation continues to evolve at a rapid pace, with sig- 
nificant advances in performance occurring every year. While the Middlebury optical flow 
website (Figure 9.8) continues to be a good source of pointers to high-performing algorithms, 
more recent publications tend to focus (both training and evaluation) on the MPI Sintel dataset 
developed by Butler, Wulff et al. (2012), some samples of which are shown in Figure 9. le-f. 
Some algorithms also train and test on the KITTI flow benchmark (Geiger, Lenz, and Urtasun 
2012), although that dataset focuses on video acquired from a driving vehicle. In general, it 
appears that learning-based algorithms trained on one dataset still have trouble when applied 
to a different dataset.!” 


'4https://vision.middlebury.edu/flow 

I5The earlier conference version of this paper had the eye-catching title of “Secrets of optical flow estimation and 
their principles” (Sun, Roth, and Black 2010). 

'6http://sintel.is.tue.mpg.de 

'7 http://www.robustvision.net 
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Figure 9.9 Iterative coarse-to-fine optical flow estimation (Sun, Yang et al. 2018) © 
2018 IEEE: (a) “classic” variational (energy minimization) approach (Sun, Roth, and Black 
2014); (b) newer neural network approach trained with end-to-end deep learning (Sun, Yang 
et al. 2018). Both figures show the processing at a single level of the coarse-to-fine pyramid, 
taking as input the flow computed by the previous (coarser) level and passing the refined flow 
onto the finer level below. 


9.3.1 Deep learning approaches 


Over the last decade, deep neural networks have become an essential component of all highly- 
performant optical flow algorithms, as described in the survey articles by Janai, Giiney et al. 
(2020, Chapter 11) and Hur and Roth (2020). An early approach to use non-linear aggregation 
inspired by deep convolutional networks is the DeepFlow system of Weinzaepfel, Revaud et 
al. (2013), which uses a hand-crafted (non-learned) convolutions and pooling to compute 
multi-level response maps (matching costs), which are then optimized using a classic energy- 
minimizing variational framework. 

The first system to use full deep end-to-end learning in an encoder-decoder network was 
FlowNetS (Dosovitskiy, Fischer et al. 2015), which was trained on the authors’ synthetic 
FlyingChairs dataset. The paper also introduced FlowNetC, which uses a correlation network 
(local cost volume). The follow-on FlowNet 2.0 system uses the initial flow estimates to 
warp the images and then refines the flow estimates using cascaded encoder-decoder networks 
(Ilg, Mayer et al. 2017), while subsequent papers also deal with occlusions and uncertainty 
modeling (lg, Saikia et al. 2018; Ilg, Çiçek et al. 2018). 

An alternative to stacking full-resolution networks in series is to use image and flow 
pyramids together with coarse-to-fine warping and refinement, as first explored in the SPyNet 
paper by Ranjan and Black (2017). The more recent PWC-Net of Sun, Yang ef al. (2018, 
2019) shown in Figure 9.9b extends this idea by first computing a feature pyramid from each 
frame, warping the second set of features by the flow interpolated from the previous resolution 
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Figure 9.10 Iterative residual refinement optical flow estimation (Hur and Roth 2019) O 
2019 IEEE. The coarse-to-fine cascade of Sun, Yang et al. (2018) in Figure 9.9b is replaced 
with a recurrent neural network (RNN) that cycles interpolated coarser level flow estimates 


as warping inputs to the next finer level but uses the same convolutional weights at each level. 


level, and then computing a cost volume by correlating these features using a dot product 


between feature maps shifted by up to d = +4 pixels. The refined optical flow estimates at 
the current level are produced using a multi-layer CNN whose inputs are the cost volume, the 
image features, and the interpolated flow from the previous level. A final context network 
takes as input the flow estimate and features from the second to last level and uses dilated 
convolutions to endow the network with a broader context. If you compare Figures 9.9a—b, 
you will see a pleasing correspondence between the various processing stages of classic and 
deep coarse-to-fine flow estimation algorithms.!* 

A variant on the coarse-to-fine PWC-Net developed by Hur and Roth (2019) is the It- 
erative Residual Refinement network shown in Figure 9.10. Instead of cascading a set of 
different deep networks as in FlowNet 2.0 and PWC-Net, IRRs re-use the same structure and 
convolution weights at each layer, which allows the network to be re-drawn in the “rolled up” 
version, as shown in this figure. The network can thus be thought of as a simple recurrent 
neural network (RNN) that upsamples the output flow estimates after each stage. In addition 
to having fewer parameters, this weight sharing also improves accuracy. In their paper, the 
authors also show how this network can be extended (doubled) to simultaneously compute 
forward and backward flows as well as occlusions. 

In more recent work, Jonschkowski, Stone et al. (2020) take PWC-Net as their basic archi- 
tecture and systematically study all of the components involved in training the flow estimator 
in an unsupervised manner, i.e., using regular real-world videos with no ground truth flow, 
which can enable much larger training sets to be used (Ahmadi and Patras 2016; Meister, 
Hur, and Roth 2018). In their paper, Jonschkowski et al. systematically compare photometric 


losses, occlusion estimation, self-supervision, and smoothness constraints, and analyze the 


'8Note that as with other coarse-to-fine warping approaches, these algorithms struggle with fast-moving fine 


structures that may not be visible at coarser levels (Brox and Malik 2010a). 
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effect of other choices, such as pre-training, image resolution, data augmentation, and batch 
size. They also propose four improvements to these key components, including cost volume 
normalization, gradient stopping for occlusion estimation, applying smoothness at the native 
flow resolution, and image resizing for self-supervision. Another recent paper that explicitly 
deals with occlusions is Jiang, Campbell ef al. (2021). 


Another recent trend has been to model the uncertainty that arises in flow field estima- 
tion due to homogeneous and occluded regions (Ilg, Çiçek et al. 2018). The HD? network 
developed by Yin, Darrell, and Yu (2019) models correspondence distributions across mul- 
tiple resolution levels, while the LiteFlowNet3 network of Hui and Loy (2020) extends their 
small and fast LiteFlowNet2 network (Hui, Tang, and Loy 2021) with cost volume modula- 
tion and flow field deformation modules to significantly improve accuracy at minimal cost. 
In concurrent work, Hofinger, Rota Bulo et al. (2020) introduce novel components such as 
replacing warping by sampling, smart gradient blocking, and knowledge distillation, which 
not only improve the quality of their flow estimates but can also be used in other applications 
such as stereo matching. Teed and Deng (2020b) build on the idea of a recurrent network 
(Hur and Roth 2019), but instead of warping feature maps, they precompute a full (W x H)? 
multi-resolution correlation volume (Recurrent All-Pairs Field Transforms or RAFT), which 
is accessed at each iteration based on the current flow estimates. Computing a sparse corre- 
lation volume storing only the k closest matches for each reference image feature can further 


accelerate the computation (Jiang, Lu et al. 2021). 


Given the rapid evolution in optical flow techniques, which is the best one to use? The 
answer is highly problem-dependent. One way to assess this is to look across a number of 
datasets, as is done in the Robust Vision Challenge.!” On this aggregated benchmark, variants 
of RAFT, IRR, and PWC all perform well. Another is to specifically evaluate a flow algorithm 
based on its indented use, and, if possible, to fine-tune the network on problem-specific data. 
Xue, Chen et al. (2019) describe how they fine-tune a SPyNet coarse-to-fine network on their 
synthetically degraded Vimeo-90K dataset to estimate task-oriented flow (TOFlow), which 
outperforms “higher accuracy” networks (and even ground truth flow) on three different video 
processing tasks, namely frame interpolation (Section 9.4.1), video denoising (Section 9.3.4), 
and video super-resolution. It is also possible to significantly improve the performance of 
learning-based flow algorithms by tailoring the synthetic training data to a target dataset (Sun, 
Vlasic et al. 2021). 


http://www.robustvision.net/leaderboard.php ?benchmark=flow 
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9.3.2 Application: Rolling shutter wobble removal 


To save on silicon circuitry and enable greater photo sensitivity or fill factors, many CMOS 
imaging sensors such as those found in mobile phones use a rolling shutter, where different 
rows or columns are exposed in sequence. When photographing or filming a scene with fast 
scene or camera motions, this can result in straight lines becoming slanted or curved (e.g., the 
propeller blades on a plane or helicopter) or rigid parts of the scene wobbling (also known as 
the jello effect), e.g., when the camera is rapidly vibrating during action photography. 

To compensate for these distortion, which are caused by different exposure times for dif- 
ferent scanlines, accurate per-pixel optical flow must be estimated, as opposed to the whole- 
frame parametric motion that can sometimes be used for slower-motion video stabilization 
(Section 9.2.1). Baker, Bennett et al. (2010) and Forssén and Ringaby (2010) were among 
the first computer vision researchers to study this problem. In their paper, Baker, Bennett ef 
al. (2010) recover a high-frequency motion field from the lower-frequency inter-frame mo- 
tions and use this to resample each output scanline. Forssén and Ringaby (2010) perform 
similar computations using models of camera rotation, which require intrinsic lens calibra- 
tion. Grundmann, Kwatra et al. (2012) remove the need for such calibration using mixtures 
of homographies to model the camera and scene motions, while Liu, Gleicher et al. (2011) 
use subspace constraints. Accurate rolling shutter correction is also required to produce high- 
quality image stitching results (Zhuang and Tran 2020). 

While in some modern imaging systems such as action cameras, inertial measurements 
units (IMUs) can provide high-frequency estimates of camera motion, they cannot directly 
provide estimates of depth-dependent parallax and independent object motions. For this rea- 
son, the best in-camera image stabilizers use a combination of IMU data and sophisticated 


image processing.? 


Modeling rolling shutter is also important to obtain accurate pose esti- 
mates in structure from motion (Hedborg, Forssén et al. 2012; Kukelova, Albl et al. 2018; 
Albl, Kukelova et al. 2020; Kukelova, Albl ef al. 2020) and visual-inertial fusion in SLAM 
(Patron-Perez, Lovegrove, and Sibley 2015; Schubert, Demmel et al. 2018), which are dis- 


cussed in Sections 11.4.2 and 11.5. 


9.3.3 Multi-frame motion estimation 


So far, we have looked at motion estimation as a two-frame problem, where the goal is to 
compute a motion field that aligns pixels from one image with those in another. In practice, 
motion estimation is usually applied to video, where a whole sequence of frames is available 


to perform this task. 


2https://gopro.com/en/us/news/hero7-black-hypersmooth-technology 
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(b) 


Figure 9.11 Slice through a spatio-temporal volume (Szeliski 19994) O 1999 IEEE: (a— 
b) two frames from the flower garden sequence; (c) a horizontal slice through the complete 
spatio-temporal volume, with the arrows indicating locations of potential key frames where 
flow is estimated. Note that the colors for the flower garden sequence are incorrect; the 


correct colors (yellow flowers) are shown in Figure 9.13. 


One classic approach to multi-frame motion is to filter the spatio-temporal volume using 
oriented or steerable filters (Heeger 1988), in a manner analogous to oriented edge detec- 
tion (Section 3.2.3). Figure 9.11 shows two frames from the commonly used flower garden 
sequence, as well as a horizontal slice through the spatio-temporal volume, i.e., the 3D vol- 
ume created by stacking all of the video frames together. Because the pixel motion is mostly 
horizontal, the slopes of individual (textured) pixel tracks, which correspond to their horizon- 
tal velocities, can clearly be seen. Spatio-temporal filtering uses a 3D volume around each 
pixel to determine the best orientation in space-time, which corresponds directly to a pixel’s 
velocity. 

Unfortunately, to obtain reasonably accurate velocity estimates everywhere in an image, 
spatio-temporal filters have moderately large extents, which severely degrades the quality of 
their estimates near motion discontinuities. (This same problem is endemic in 2D window- 
based motion estimators.) An alternative to full spatio-temporal filtering is to estimate more 
local spatio-temporal derivatives and use them inside a global optimization framework to fill 
in textureless regions (Bruhn, Weickert, and Schnórr 2005; Govindu 2006). 

Another alternative is to simultaneously estimate multiple motion estimates, while also 
optionally reasoning about occlusion relationships (Szeliski 1999a). Figure 9.11c shows 
schematically one potential approach to this problem. The horizontal arrows show the lo- 
cations of keyframes s where motion is estimated, while other slices indicate video frames t 
whose colors are matched with those predicted by interpolating between the keyframes. Mo- 
tion estimation can be cast as a global energy minimization problem that simultaneously min- 
imizes brightness compatibility and flow compatibility terms between keyframes and other 
frames, in addition to using robust smoothness terms. 


The multi-view framework is potentially even more appropriate for rigid scene motion 
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(multi-view stereo) (Section 12.7), where the unknowns at each pixel are disparities and 
occlusion relationships can be determined directly from pixel depths (Szeliski 1999a; Kol- 
mogorov and Zabih 2002). However, it is also applicable to general motion, with the addition 
of models for occlusion relationships, as in the MirrorFlow system of Hur and Roth (2017) as 
well as multi-frame versions (Janai, Guney et al. 2018; Neoral, Sochman, and Matas 2018; 
Ren, Gallo et al. 2019). 


9.3.4 Application: Video denoising 


Video denoising is the process of removing noise and other artifacts such as scratches from 
film and video (Kokaram 2004; Gai and Kang 2009; Liu and Freeman 2010). Unlike single 
image denoising, where the only information available is in the current picture, video denois- 
ers can average or borrow information from adjacent frames. However, to do this without 
introducing blur or jitter (irregular motion), they need accurate per-pixel motion estimates. 
One way to do this is to use task-oriented flow, where the flow network is specifically tuned 
end-to-end to provide the best denoising performance (Xue, Chen et al. 2019). 

Exercise 9.6 lists some of the steps required, which include the ability to determine if the 
current motion estimate is accurate enough to permit averaging with other frames. And while 
some recent papers continue to estimate flow as part of the multi-frame denoising pipeline 
(Tassano, Delon, and Veit 2019; Xue, Chen et al. 2019), others either concatenate similar 
patches from different frames (Maggioni, Boracchi et al. 2012) or concatenate small sub- 
sets Of frames into a deep network that never explicitly estimates a motion representation 
(Claus and van Gemert 2019; Tassano, Delon, and Veit 2020). A more general form of video 
enhancement and restoration called video quality mapping has also recently started being 
investigated (Fuoli, Huang et al. 2020). 


9.4 Layered motion 


In many situations, visual motion is caused by the movement of a small number of objects 
at different depths in the scene. In such situations, the pixel motions can be described more 
succinctly (and estimated more reliably) if pixels are grouped into appropriate objects or 
layers (Wang and Adelson 1994). 

Figure 9.12 shows this approach schematically. The motion in this sequence is caused by 
the translational motion of the checkered background and the rotation of the foreground hand. 
The complete motion sequence can be reconstructed from the appearance of the foreground 


and background elements, which can be represented as alpha-matted images (sprites or video 
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Figure 9.12 Layered motion estimation framework (Wang and Adelson 1994) © 1994 
IEEE: The top two rows describe the two layers, each of which consists of an intensity (color) 
image, an alpha mask (black=transparent), and a parametric motion field. The layers are 


composited with different amounts of motion to recreate the video sequence. 


objects) and the parametric motion corresponding to each layer. Displacing and compositing 
these layers in back to front order (Section 3.1.3) recreates the original video sequence. 

Layered motion representations not only lead to compact representations (Wang and 
Adelson 1994; Lee, Chen et al. 1997), but they also exploit the information available in 
multiple video frames, as well as accurately modeling the appearance of pixels near motion 
discontinuities. This makes them particularly suited as a representation for image-based ren- 
dering (Section 14.2.1) (Shade, Gortler et al. 1998; Zitnick, Kang et al. 2004) as well as 
object-level video editing. 

To compute a layered representation of a video sequence, Wang and Adelson (1994) first 
estimate affine motion models over a collection of non-overlapping patches and then cluster 
these estimates using k-means. They then alternate between assigning pixels to layers and 
recomputing motion estimates for each layer using the assigned pixels, using a technique 
first proposed by Darrell and Pentland (1991). Once the parametric motions and pixel-wise 
layer assignments have been computed for each frame independently, layers are constructed 
by warping and merging the various layer pieces from all of the frames together. Median 


filtering is used to produce sharp composite layers that are robust to small intensity variations, 
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Figure 9.13 Layered motion estimation results (Wang and Adelson 1994) © 1994 IEEE. 


as well as to infer occlusion relationships between the layers. Figure 9.13 shows the results 
of this process on the flower garden sequence. You can see both the initial and final layer 
assignments for one of the frames, as well as the composite flow and the alpha-matted layers 
with their corresponding flow vectors overlaid. 

In follow-on work, Weiss and Adelson (1996) use a formal probabilistic mixture model 
to infer both the optimal number of layers and the per-pixel layer assignments. Weiss (1997) 
further generalizes this approach by replacing the per-layer affine motion models with smooth 
regularized per-pixel motion estimates, which allows the system to better handle curved and 
undulating layers, such as those seen in most real-world sequences. 

The above approaches, however, still make a distinction between estimating the motions 
and layer assignments and then later estimating the layer colors. In the system described 
by Baker, Szeliski, and Anandan (1998), the generative model is generalized to account for 
real-world rigid motion scenes. The motion of each frame is described using a 3D camera 
model and the motion of each layer is described using a 3D plane equation plus per-pixel 
residual depth offsets (the plane plus parallax representation (Section 2.1.4)). The initial 
layer estimation proceeds in a manner similar to that of Wang and Adelson (1994), except 
that rigid planar motions (homographies) are used instead of affine motion models. The final 
model refinement, however, jointly re-optimizes the layer pixel color and opacity values L; 
and the 3D depth, plane, and motion parameters z;, n;, and P, by minimizing the discrepancy 
between the re-synthesized and observed motion sequences (Baker, Szeliski, and Anandan 
1998). 

Figure 9.14 shows the final results obtained with this algorithm. As you can see, the 
motion boundaries and layer assignments are much crisper than those in Figure 9.13. Because 
of the per-pixel depth offsets, the individual layer color values are also sharper than those 
obtained with affine or planar motion models. While the original system of Baker, Szeliski, 
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Figure 9.14 Layered stereo reconstruction (Baker, Szeliski, and Anandan 1998) © 1998 
IEEE: (a) first and (b) last input images; (c) initial segmentation into six layers; (d) and 
(e) the six layer sprites; (f) depth map for planar sprites (darker denotes closer); front layer 
(g) before and (h) after residual depth estimation. Note that the colors for the flower garden 


sequence are incorrect; the correct colors (yellow flowers) are shown in Figure 9.13. 


and Anandan (1998) required a rough initial assignment of pixels to layers, Torr, Szeliski, 
and Anandan (2001) describe automated Bayesian techniques for initializing this system and 
determining the optimal number of layers. 

Layered motion estimation continues to be an active area of research. Representative 
papers from the 2000s include (Sawhney and Ayer 1996; Jojic and Frey 2001; Xiao and 
Shah 2005; Kumar, Torr, and Zisserman 2008; Thayananthan, Iwasaki, and Cipolla 2008; 
Schoenemann and Cremers 2008), while more recent papers include (Sun, Sudderth, and 
Black 2012; Sun, Wulff et al. 2013; Sun, Liu, and Pfister 2014; Wulff and Black 2015) and 
(Sevilla-Lara, Sun et al. 2016), which jointly performs semantic segmentation and motion 
estimation. 

Layers are not the only way to introduce segmentation into motion estimation. A large 
number of algorithms have been developed that alternate between estimating optical flow 
vectors and segmenting them into coherent regions (Black and Jepson 1996; Ju, Black, and 
Jepson 1996; Chang, Tekalp, and Sezan 1997; Mémin and Pérez 2002; Cremers and Soatto 
2005). Some of these techniques rely on first segmenting the input color images and then 
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estimating per-segment motions that produce a coherent motion field while also modeling oc- 
clusions (Zitnick, Kang et al. 2004; Zitnick, Jojic, and Kang 2005; Stein, Hoiem, and Hebert 
2007; Thayananthan, Iwasaki, and Cipolla 2008). In fact, the segmentation of videos into 
coherently moving parts has evolved into its own topic, namely video object segmentation, 
which we study in Section 9.4.3. 


9.4.1 Application: Frame interpolation 


Frame interpolation is a widely used application of motion estimation, often implemented in 
hardware to match an incoming video to a monitor’s actual refresh rate, where information in 
novel in-between frames needs to be interpolated from preceding and subsequent frames. The 
best results can be obtained if an accurate motion estimate can be computed at each unknown 
pixel’s location. However, in addition to computing the motion, occlusion information is 
critical to prevent colors from being contaminated by moving foreground objects that might 
obscure a particular pixel in a preceding or subsequent frame. 

In a little more detail, consider Figure 9.1 1c and assume that the arrows denote keyframes 
between which we wish to interpolate additional images. The orientations of the streaks 
in this figure encode the velocities of individual pixels. If the same motion estimate ug is 
obtained at location xg in image J as is obtained at location xp + Ug in image J4, the flow 
vectors are said to be consistent. This motion estimate can be transferred to location xp + tug 
in the image J; being generated, where t € (0, 1) is the time of interpolation. The final color 


value at pixel xo + tug can be computed as a linear blend, 
L (xo + tuo) = (1 = t)Lo(xo) + th (xo + Ug). (9.59) 


If, however, the motion vectors are different at corresponding locations, some method must be 
used to determine which is correct and which image contains colors that are occluded. The ac- 
tual reasoning is even more subtle than this. One example of such an interpolation algorithm, 
based on earlier work in depth map interpolation by Shade, Gortler et al. (1998) and Zitnick, 
Kang et al. (2004), is the one used in the flow evaluation paper of Baker, Scharstein et al. 
(2011). An even higher-quality frame interpolation algorithm, which uses gradient-based re- 
construction, is presented by Mahajan, Huang et al. (2009). Accuracy on frame interpolation 
tasks is also sometimes used to gauge the quality of motion estimation algorithms (Szeliski 
1999b; Baker, Scharstein et al. 2011). 

More recent frame interpolation techniques use deep neural networks as part of their 
architectures. Some approaches use spatio-temporal convolutions (Niklaus, Mai, and Liu 
2017), while others use DNNs to compute bi-directional optical flow (Xue, Chen et al. 2019) 


594 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


—— nm — 

paria o. 

optical flow 

estimator NN: 
i 


optical flow 
estimator 


Figure 9.15 Deep feature video interpolation network (Niklaus and Liu 2020) © 2020 
IEEE. This multi-stage network first computes bi-directional flow, encodes each frame using 
feature pyramids, and then warps and combines these features using softmax splatting. The 
combined features are then fed into a final image synthesis network (decoder). 


and then combine the contributions from the two original frames using either context fea- 
tures (Niklaus and Liu 2018) or soft visibility maps (Jiang, Sun et al. 2018). The system by 
Niklaus and Liu (2020) encodes the input frames as deep multi-resolution neural features, 
forward warps these using bi-directional flow, combines these features using softmax splat- 
ting, and then uses a final deep network to decode these combined features, as shown in 
Figure 9.15. A similar architecture can also be used to create temporally textured looping 
videos from a single still image (Holynski, Curless et al. 2021). Other recently developed 
frame interpolation networks include Choi, Choi et al. (2020), Lee, Kim et al. (2020), Kang, 
Jo et al. (2020), and Park, Ko et al. (2020). 


9.4.2 Transparent layers and reflections 


A special case of layered motion that occurs quite often is transparent motion, which is usu- 
ally caused by reflections seen in windows and picture frames (Figures 9.16 and 9.17). 
Some of the early work in this area handles transparent motion by either just estimating 
the component motions (Shizawa and Mase 1991; Bergen, Burt et al. 1992; Darrell and Si- 
moncelli 1993; Irani, Rousso, and Peleg 1994) or by assigning individual pixels to competing 
motion layers (Darrell and Pentland 1995; Black and Anandan 1996; Ju, Black, and Jepson 
1996), which is appropriate for scenes partially seen through a fine occluder (e.g., foliage). 
However, to accurately separate truly transparent layers, a better model for motion due to 
reflections is required. Because of the way that light is both reflected from and transmitted 
through a glass surface, the correct model for reflections is an additive one, where each mov- 


ing layer contributes some intensity to the final image (Szeliski, Avidan, and Anandan 2000). 
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Figure 9.16 Light reflecting off the transparent glass of a picture frame: (a) first image from 
the input sequence; (b) dominant motion layer min-composite; (c) secondary motion residual 
layer max-composite; (d—e) final estimated picture and reflection layers The original images 
are from Black and Anandan (1996), while the separated layers are from Szeliski, Avidan, 
and Anandan (2000) © 2000 IEEE. 


If the motions of the individual layers are known, the recovery of the individual layers is 
a simple constrained least squares problem, with the individual layer images are constrained 
to be positive and saturated pixels provide an inequality constraint on the summed values. 
However, this problem can suffer from extended low-frequency ambiguities, especially if ei- 
ther of the layers lacks dark (black) pixels or the motion is uni-directional. In their paper, 
Szeliski, Avidan, and Anandan (2000) show that the simultaneous estimation of the motions 
and layer values can be obtained by alternating between robustly computing the motion lay- 
ers and then making conservative (upper- or lower-bound) estimates of the layer intensities. 
The final motion and layer estimates can then be polished using gradient descent on a joint 
constrained least squares formulation similar to Baker, Szeliski, and Anandan (1998), where 
the over compositing operator is replaced with addition. 

Figures 9.16 and 9.17 show the results of applying these techniques to two different pic- 
ture frames with reflections. Notice how, in the second sequence, the amount of reflected light 
is quite low compared to the transmitted light (the picture of the girl) and yet the algorithm is 


still able to recover both layers. 
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(a) (b) (c) (d) (e) 


Figure 9.17 Transparent motion separation (Szeliski, Avidan, and Anandan 2000) O 2000 


IEEE: (a) first image from input sequence; (b) dominant motion layer min-composite; (c) sec- 


ondary motion residual layer max-composite; (d—e) final estimated picture and reflection lay- 
ers. Note that the reflected layers in (c) and (e) are doubled in intensity to better show their 


structure. 


Unfortunately, the simple parametric motion models used in Szeliski, Avidan, and Anan- 
dan (2000) are only valid for planar reflectors and scenes with shallow depth. The extension 
of these techniques to curved reflectors and scenes with significant depth has also been stud- 
ied (Swaminathan, Kang et al. 2002; Criminisi, Kang et al. 2005; Jacquet, Hane et al. 2013), 
as has the extension to scenes with more complex 3D depth (Tsin, Kang, and Szeliski 2006). 
While motion sequences used to evaluate optical flow techniques have also started to include 
reflection and transparency (Baker, Scharstein et al. 2011; Butler, Wulff et al. 2012), the 
ground truth flow estimates they provide and use for evaluation only include the dominant 


motion at each pixel, e.g., ignoring mist and reflections. 


In more recent work, Sinha, Kopf et al. (2012) model 3D scenes with reflections captured 
from a moving camera using two layers with varying depth and reflectivity and then use 
these to produce image-based renderings (novel view synthesis), which we discuss in more 
detail in Section 14.2.1. Kopf, Langguth et al. (2013) extend the modeling and rendering 
component of this system to recover colored image gradients for each layer and then use 
gradient-domain rendering to reconstruct the novel views. Xue, Rubinstein et al. (2015) 
extend these models with a gradient sparsity prior to enable obstruction-free photography 
when looking through windows and fences. More recent papers on this topic include Yang, 
Li et al. (2016), Nandoriya, Elgharib et al. (2017), and Liu, Lai et al. (2020a). The advent 
of dual-pixel imaging sensors, originally designed to provide fast focusing, can also be used 
to remove reflections by separating gradients into different depth planes (Punnappurath and 
Brown 2019). 
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Figure 9.18 Sample sequences from the Densely Annotated Video Segmentation (DAVIS) 
datasets O Pont-Tuset, Perazzi et al. (2017). The DAVIS 2016 dataset (a) only contains 
foreground-background segmentations (red regions), while the DAVIS 2017 dataset (b) con- 


tains multiple annotated objects in each sequence (brightly colored regions). 


While all of these techniques are useful for separating or eliminating reflections that ap- 
pear as coherent images, more complex 3D geometries often give rise to spatially distributed 
specularities (Section 2.2.2) that are not amenable to layer-based representation. In such 
cases, lightfield representations such as surface lightfields (Section 14.3.2 and Figure 14.13) 
and neural light fields (Section 14.6 and Figure 14.24b) may be more appropriate. 


9.4.3 Video object segmentation 


As we have seen throughout this chapter, the accurate estimation of motion usually requires 
the segmentation of a video into coherently moving regions or objects as well as the cor- 
rect modeling of occlusions. Segmenting a video clip into coherent objects is the temporal 
analog to still image segmentation, which we studied in Section 7.5. In addition to provid- 
ing more accurate motion estimates, video object segmentation supports a variety of editing 
tasks, such as object removal and insertion (Section 10.4.5) as well as video understanding 
and interpretation. 

While the segmentation of foreground and background layers has been studied for a long 
time (Bergen, Anandan et al. 1992; Wang and Adelson 1994; Gorelick, Blank et al. 2007; 
Lee and Grauman 2010; Brox and Malik 2010b; Lee, Kim, and Grauman 2011; Fragkiadaki, 
Zhang, and Shi 2012; Papazoglou and Ferrari 2013; Wang, Shen, and Porikli 2015; Perazzi, 
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Wang et al. 2015), the introduction of DAVIS (Densely Annotated VIdeo Segmentation) by 
Perazzi, Pont-Tuset et al. (2016) greatly accelerated research in this area. Figure 9.18a shows 
some frames from the original DAVIS 2016 dataset, where the first frame is annotated with 
a foreground pixel mask (shown in red) and the task is to estimate foreground masks for the 
remaining frames. The DAVIS 2017 dataset (Pont-Tuset, Perazzi et al. 2017) increased the 
number of video clips from 50 to 150, added more challenging elements such as motion blur 
and foreground occlusions, and most importantly, added more than one annotated object per 
sequence (Figure 9.18b). 

Algorithm for video object segmentation such as OSVOS (Caelles, Maninis et al. 2017), 
FusionSeg (Jain, Xiong, and Grauman 2017), MaskTrack (Perazzi, Khoreva et al. 2017), and 
SegFlow (Cheng, Tsai et al. 2017), usually consist of a deep per-frame segmentation network 
as well as a motion estimation algorithm, which is used to link and refine the segmentations. 
Some approaches (Caelles, Maninis et al. 2017; Khoreva, Benenson et al. 2019) also fine- 
tune the segmentation networks based on the first frame annotations. More recent approaches 
have focused on increasing the computational efficiency of the pipelines (Chen, Pont-Tuset 
et al. 2018; Cheng, Tsai et al. 2018; Wug Oh, Lee et al. 2018; Wang, Zhang et al. 2019; 
Meinhardt and Leal-Taixé 2020). 

Since 2017, an annual challenge and workshop on the DAVIS dataset have been held 
in conjunction with CVPR. More recent additions to the challenges have been segmenta- 
tion with weaker annotations/scribbles (Caelles, Montes et al. 2018) or completely unsu- 
pervised segmentation, where the algorithms compute temporally linked segmentations of 
the video frames (Caelles, Pont-Tuset et al. 2019). There is also a newer, larger, dataset 
called YouTube-VOS (Xu, Yang et al. 2018) with its own associated set of challenges and 
leaderboards. The number of papers published on the topic continues to be high. The 
best sources for recent work are the challenge leaderboards at https://davischallenge.org and 
https://youtube-vos.org, which are accompanied by short papers describing the techniques, as 
well as the large number of conference papers, which usually have “Video Object Segmenta- 
tion” in their titles. 


9.4.4 Video object tracking 


One of the most widely used applications of computer vision to video analysis is video object 
tracking. These applications include surveillance (Benfold and Reid 2011), animal and cell 
tracking (Khan, Balch, and Dellaert 2005), sports player tracking (Lu, Ting et al. 2013), and 
automotive safety (Janai, Giiney et al. 2020, Chapter 6). 

We have already discussed simpler examples of tracking in previous chapters, including 


feature (patch) tracking in Section 7.1.5 and contour tracking in Section 7.3. Surveys and 
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Figure 9.19 Visual object tracking (Smeulders, Chu et al. 2014) ©2014 IEEE: (a) high- 
level model showing main tracker components; (b) some tracked region representations, in- 
cluding a single bounding box, contour, blob, patch-based, sparse features, parts, and multi- 


ple bounding boxes. 


experimental evaluation of such techniques include Lepetit and Fua (2005), Yilmaz, Javed, 
and Shah (2006), Wu, Lim, and Yang (2013), and Janai, Güney et al. (2020, Chapter 6). 

A great starting point for learning more about tracking is the survey and tutorial by Smeul- 
ders, Chu et al. (2014), which was also one of the first large-scale tracking datasets, with over 
300 video clips, ranging from a few seconds to a few minutes. Figure 9.19a shows some of 
the main components usually present in an online tracking system, which include choosing 
representations for shape, motion, position, and appearance, as well as similarity measures, 
optimization, and optional model updating. Figure 9.19b shows some of the choices for repre- 
senting shapes and appearance, including a single bounding box, contours, patches, features, 
and parts. 

The paper includes a discussion of previous surveys and techniques, as well as datasets, 
evaluation measures, and the above-mentioned model choices. It then categorizes a selection 
of well-known and more recent algorithms into a taxonomy that includes simple matching 
with fixed templates, extended and constrained (sparse) appearance models, discriminative 
classifiers, and tracking by detection. The algorithms discussed and evaluated include KLT, 
as implemented by Baker and Matthews (2004), mean-shift (Comaniciu and Meer 2002) 
and fragments-based (Adam, Rivlin, and Shimshoni 2006) tracking, online PCA appearance 
models (Ross, Lim et al. 2008), sparse bases (Mei and Ling 2009), and Struct (Hare, Golodetz 
et al. 2015), which uses kernelized structured output support vector machine. 

Around the same time (2013), a series of annual challenges and workshops on single- 
target short-term tracking called VOT (visual object tracking) began.?! In their journal paper 
describing the evaluation methodology, Kristan, Matas et al. (2016) evaluate recent trackers 
and find that variants of Struct as well as extensions of kernelized correlation filters (KCF), 


2 https://www.votchallenge.net 
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originally developed by Henriques, Caseiro et al. (2014), performed the best. Other highly in- 
fluential papers from this era include (Bertinetto, Valmadre et al. 2016a,b; Danelljan, Robin- 
son et al. 2016). Since that time, deep networks have played an essential role in visual object 
tracking, often using Siamese networks (Section 5.3.4; Bromley, Guyon et al. 1994; Chopra, 
Hadsell, and LeCun 2005) to map regions being tracked into neural embeddings. Lists and 
descriptions of more recent tracking algorithms can be found in the annual reports that ac- 
company the VOT challenges and workshops, the most recent of which is Kristan, Leonardis 
et al. (2020). 

In parallel with the single-object VOT challenges and workshops, a multiple object track- 
ing was introduced as part of the KITTI vision benchmark (Geiger, Lenz, and Urtasun 2012) 
and a separate benchmark was developed by Leal-Taixé, Milan et al. (2015) along with a se- 
ries of challenges, with the most recent results described in Dendorfer, OSep et al. (2021). 
A survey of multiple object tracking papers through 2016 can be found in Luo, Xing et al. 
(2021). Simple and fast multiple object trackers include Bergmann, Meinhardt, and Leal- 
Taixé (2019) and Zhou, Koltun, and Krahenbiihl (2020). Until recently, however, tracking 
datasets have focused mostly on people, vehicles, and animals. To expand the range of ob- 
jects that can be tracked, Dave, Khurana et al. (2020) created the TAO (tracking any object) 
dataset, consisting of 2,907 videos, which were annotated “bottom-up” by first having users 
tag anything that moves and then classifying such objects into 833 categories. 

While in this section, we have focused mostly on object tracking, the primary goal of 
which is to locate an object in contiguous video frames, it is also possible to simultaneously 
track and segment (Voigtlaender, Krause et al. 2019; Wang, Zhang et al. 2019) or to track 
non-rigidly deforming objects such as T-shirts with deformable models from either video 
(Kambhamettu, Goldgof et al. 2003; White, Crane, and Forsyth 2007; Pilet, Lepetit, and 
Fua 2008; Furukawa and Ponce 2008; Salzmann and Fua 2010) or RGB-D streams (Bozic, 
Zollhófer et al. 2020; Božič, Palafox et al. 2020, 2021). The recent TrackFormer paper 
by Meinhardt, Kirillov et al. (2021) includes a nice review of recent work of multi-object 


tracking and segmentation. 


9.5 Additional reading 


Some of the earliest algorithms for motion estimation were developed for motion-compen- 
sated video coding (Netravali and Robbins 1979) and such techniques continue to be used 
in modern coding standards such as MPEG, H.263, and H.264 (Le Gall 1991; Richardson 


>? https://motchallenge.net 
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2003).% In computer vision, this field was originally called image sequence analysis (Huang 
1981). Some of the early seminal papers include the variational approaches developed by 
Horn and Schunck (1981) and Nagel and Enkelmann (1986), and the patch-based translational 
alignment technique developed by Lucas and Kanade (1981). Hierarchical (coarse-to-fine) 
versions of such algorithms were developed by Quam (1984), Anandan (1989), and Bergen, 
Anandan et al. (1992), although they have also long been used in motion estimation for video 
coding. 

Translational motion models were generalized to affine motion by Rehg and Witkin (1991), 
Fuh and Maragos (1991), and Bergen, Anandan et al. (1992) and to quadric reference sur- 
faces by Shashua and Toelg (1997) and Shashua and Wexler (2001)—see Baker and Matthews 
(2004) for a nice review. Such parametric motion estimation algorithms have found wide- 
spread application in video summarization (Teodosio and Bender 1993; Irani and Anandan 
1998), video stabilization (Hansen, Anandan et al. 1994; Srinivasan, Chellappa et al. 2005; 
Matsushita, Ofek et al. 2006), and video compression (Irani, Hsu, and Anandan 1995; Lee, 
Chen et al. 1997). Surveys of parametric image registration include those by Brown (1992), 
Zitov’aa and Flusser (2003), Goshtasby (2005), and Szeliski (2006a). 

Good general surveys and comparisons of optical flow algorithms include those by Aggar- 
wal and Nandhakumar (1988), Barron, Fleet, and Beauchemin (1994), Otte and Nagel (1994), 
Mitiche and Bouthemy (1996), Stiller and Konrad (1999), McCane, Novins et al. (2001), 
Szeliski (2006a), and Baker, Scharstein et al. (2011), Sun, Yang et al. (2018), Janai, Güney et 
al. (2020), and Hur and Roth (2020). The topic of matching primitives, i.e., pre-transforming 
images using filtering or other techniques before matching, is treated in a number of papers 
(Anandan 1989; Bergen, Anandan et al. 1992; Scharstein 1994; Zabih and Woodfill 1994; 
Cox, Roy, and Hingorani 1995; Viola and Wells III 1997; Negahdaripour 1998; Kim, Kol- 
mogorov, and Zabih 2003; Jia and Tang 2003; Papenberg, Bruhn et al. 2006; Seitz and Baker 
2009). Hirschmiiller and Scharstein (2009) compare a number of these approaches and report 
on their relative performance in scenes with exposure differences. 

The publication of the first large benchmark for evaluating optical flow algorithms by 
Baker, Scharstein et al. (2011) led to rapid advances in the quality of estimation algorithms. 
While most of the best performing algorithms used robust data and smoothness norms such as 
Lı or TV and continuous variational optimization techniques, some algorithms used discrete 
optimization or segmentation (Papenberg, Bruhn et al. 2006; Trobin, Pock et al. 2008; Xu, 
Chen, and Jia 2008; Lempitsky, Roth, and Rother 2008; Werlberger, Trobin et al. 2009; Lei 
and Yang 2009; Wedel, Cremers et al. 2009). 

The creation of the Sintel (Butler, Wulff et al. 2012) and KITTI (Geiger, Lenz, and Urta- 


3https://www.itu.int/rec/T- REC-H.264. 
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sun 2012) datasets further accelerated progress in optical flow algorithms. Significant papers 
from this past decade include Weinzaepfel, Revaud et al. (2013), Sun, Roth, and Black (2014), 
Revaud, Weinzaepfel et al. (2015), Ilg, Mayer et al. (2017), Xu, Ranftl, and Koltun (2017), 
Sun, Yang et al. (2018, 2019), Hur and Roth (2019), and Teed and Deng (2020b). Good 
review of flow papers from the last decade can be found in Sun, Yang et al. (2018), Janai, 
Giiney et al. (2020), and Hur and Roth (2020). 

Good starting places to read about video object segmentation and video object track- 
ing are recent workshops associated with the main datasets and challenges on these topics 
(Pont-Tuset, Perazzi et al. 2017; Xu, Yang et al. 2018; Kristan, Leonardis et al. 2020; Dave, 
Khurana et al. 2020; Dendorfer, OSep et al. 2021). 


9.6 Exercises 


Ex 9.1: Correlation. Implement and compare the performance of the following correlation 
algorithms: 


e sum of squared differences (9.1) 


sum of robust differences (9.2) 


sum of absolute differences (9.3) 


bias—gain compensated squared differences (9.9) 


normalized cross-correlation (9.11) 


windowed versions of the above (9.22—9.23) 


Fourier-based implementations of the above measures (9.18—9.20) 


phase correlation (9.24) 
e gradient cross-correlation (Argyriou and Vlachos 2003). 


Compare a few of your algorithms on different motion sequences with different amounts of 
noise, exposure variation, occlusion, and frequency variations (e.g., high-frequency textures, 
such as sand or cloth, and low-frequency images, such as clouds or motion-blurred video). 
Some datasets with illumination variation and ground truth correspondences (horizontal mo- 
tion) can be found at https://vision.middlebury.edu/stereo/data (the 2005 and 2006 datasets). 


Some additional ideas, variants, and questions: 
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1. When do you think that phase correlation will outperform regular correlation or SSD? 


Can you show this experimentally or justify it analytically? 


2. For the Fourier-based masked or windowed correlation and sum of squared differences, 
the results should be the same as the direct implementations. Note that you will have 
to expand (9.5) into a sum of pairwise correlations, just as in (9.22). (This is part of the 


exercise.) 


3. For the bias—gain corrected variant of squared differences (9.9), you will also have 
to expand the terms to end up with a 3 x 3 (least squares) system of equations. If 
implementing the Fast Fourier Transform version, you will need to figure out how all 


of these entries can be evaluated in the Fourier domain. 


4. (Optional) Implement some of the additional techniques studied by Hirschmiiller and 
Scharstein (2009) and see if your results agree with theirs. 


Ex 9.2: Affine registration. Implement a coarse-to-fine direct method for affine and pro- 


jective image alignment. 


1. Does it help to use lower-order (simpler) models at coarser levels of the pyramid 
(Bergen, Anandan et al. 1992)? 


2. (Optional) Implement patch-based acceleration (Shum and Szeliski 2000; Baker and 
Matthews 2004). 


3. See the Baker and Matthews (2004) survey for more comparisons and ideas. 


Ex 9.3: Stabilization. Write a program to stabilize an input video sequence. You could 


implement the following steps, as described in Section 9.2.1: 


1. Compute the translation (and, optionally, rotation) between successive frames with ro- 


bust outlier rejection. 


2. Perform temporal high-pass filtering on the motion parameters to remove the low- 


frequency component (smooth the motion). 


3. Compensate for the high-frequency motion, zooming in slightly (a user-specified amount) 


to avoid missing edge pixels. 


4. (Optional) Do not zoom in, but instead borrow pixels from previous or subsequent 


frames to fill in. 
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5. (Optional) Compensate for images that are blurry because of fast motion by “stealing” 


higher frequencies from adjacent frames. 


Ex 9.4: Optical flow. Compute optical flow (spline-based or per-pixel) between two im- 


ages, using one or more of the techniques described in this chapter. 


1. Test your algorithms on the motion sequences available at https://vision.middlebury. 
edu/flow or http://sintel.is.tue.mpg.de and compare your results (visually) to those avail- 
able on these websites. If you think your algorithm is competitive with the best, con- 


sider submitting it for formal evaluation. 


2. Visualize the quality of your results by generating in-between images using frame in- 


terpolation (Exercise 9.5). 


3. What can you say about the relative efficiency (speed) of your approach? 


Ex 9.5: Automated morphing and frame interpolation. Write a program to automatically 
morph between pairs of images. Implement the following steps, as sketched out in Sec- 
tion 9.4.1 and by Baker, Scharstein et al. (2011): 


1. Compute the flow both ways (previous exercise). Consider using a multi-frame (n > 2) 
technique to better deal with occluded regions. 


2. For each intermediate (morphed) image, compute a set of flow vectors and which im- 


ages should be used in the final composition. 
3. Blend (cross-dissolve) the images and view with a sequence viewer. 


Try this out on images of your friends and colleagues and see what kinds of morphs you get. 
Alternatively, take a video sequence and do a high-quality slow-motion effect. Compare your 
algorithm with simple cross-fading. 


Ex 9.6: Video denoising. Implement the algorithm sketched in Application 9.3.4. Your 
algorithm should contain the following steps: 


1. Compute accurate per-pixel flow. 
2. Determine which pixels in the reference image have good matches with other frames. 


3. Either average all of the matched pixels or choose the sharpest image, if trying to 
compensate for blur. Don’t forget to use regular single-frame denoising techniques as 


part of your solution, (see Section 3.4.2 and Exercise 3.12). 
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4. Devise a fall-back strategy for areas where you don’t think the flow estimates are accu- 


rate enough. 


Ex 9.7: Layered motion estimation. Decompose into separate layers (Section 9.4) a video 
sequence of a scene taken with a moving camera: 


1. Find the set of dominant (affine or planar perspective) motions, either by computing 


them in blocks or by finding a robust estimate and then iteratively re-fitting outliers. 
2. Determine which pixels go with each motion. 
3. Construct the layers by blending pixels from different frames. 
4. (Optional) Add per-pixel residual flows or depths. 
5. (Optional) Refine your estimates using an iterative global optimization technique. 


6. (Optional) Write an interactive renderer to generate in-between frames or view the 
scene from different viewpoints (Shade, Gortler et al. 1998). 


7. (Optional) Construct an unwrap mosaic from a more complex scene and use this to do 
some video editing (Rav-Acha, Kohli et al. 2008). 


Ex 9.8: Transparent motion and reflection estimation. Take a video sequence looking through 


a window (or picture frame) and see if you can remove the reflection to better see what is in- 
side. 

The steps are described in Section 9.4.2 and by Szeliski, Avidan, and Anandan (2000). 
Alternative approaches can be found in work by Shizawa and Mase (1991), Bergen, Burt et 
al. (1992), Darrell and Simoncelli (1993), Darrell and Pentland (1995), Irani, Rousso, and 
Peleg (1994), Black and Anandan (1996), and Ju, Black, and Jepson (1996). 


Ex 9.9: Motion segmentation. Write a program to segment an image into separately mov- 
ing regions or to reliably find motion boundaries. 

Use the DAVIS motion segmentation database (Pont-Tuset, Perazzi et al. 2017) as some 
of your test data. 


Ex 9.10: Video object tracking. Write an object tracker and test it out on one of the latest 
video object tracking datasets (Leal-Taixé, Milan et al. 2015; Kristan, Matas et al. 2016; 
Dave, Khurana et al. 2020; Kristan, Leonardis et al. 2020; Dendorfer, OSep et al. 2021). 
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No-Flash Detail Transfer with Denoising 


(d) 


Figure 10.1 Computational photography: (a) merging multiple exposures to create high 
dynamic range images (Debevec and Malik 1997) O 1997 ACM; (b) merging flash and non- 
flash photographs; (Petschnigg, Agrawala et al. 2004) O 2004 ACM; (c) image matting and 
compositing; (Chuang, Curless et al. 2001) O 2001 IEEE; (d) hole filling with inpainting 
(Criminisi, Pérez, and Toyama 2004) O 2004 IEEE. 


10 Computational photography 609 


Of all the advances in computer vision in the last decade, computational photography has 
arguably had the most widespread commercial impact. In 2010, the seminal Frankencamera 
paper by Adams, Talvala et al. (2010) had just been released, as had one of the first widely 
used in-camera panoramic image stitching apps.! Fast forward to 2020, and every smartphone 
now has built-in panoramic stitching, high dynamic range (HDR) exposure merging, and 
multi-image denoising and super-resolution (Hasinoff, Sharlet et al. 2016; Wronski, Garcia- 
Dorado et al. 2019; Liba, Murthy et al. 2019), and the newest phones are also simulating 
shallow depth of field (bokeh) with multiple lenses or dual pixels (Barron, Adams et al. 2015; 
Wadhwa, Garg et al. 2018; Garg, Wadhwa et al. 2019; Zhang, Wadhwa et al. 2020). 

In Section 8.2, we described how to stitch multiple images into wide field of view panora- 
mas, allowing us to create photographs that could not be captured with a regular camera. 
This is just one instance of computational photography, where image analysis and process- 
ing algorithms are applied to one or more photographs to create images that go beyond the 
capabilities of traditional imaging systems. 

In this chapter, we cover a number of additional computational photography algorithms. 
We begin with a review of photometric image calibration (Section 10.1), i.e., the measurement 
of camera and lens responses, which is a prerequisite for many of the algorithms we describe 
later. We then discuss high dynamic range imaging (Section 10.2), which captures the full 
range of brightness in a scene through the use of multiple exposures (Figure 10.1a). We also 
discuss tone mapping operators, which map wide-gamut images back into regular display 
devices such as screens and printers, as well as algorithms that merge flash and regular images 
to obtain better exposures (Figure 10.1b). 

Next, we discuss how the resolution and visual quality of images can be improved ei- 
ther by merging multiple photographs together or using sophisticated image priors or deep 
networks (Section 10.3). This includes algorithms for extracting full-color images from the 
patterned Bayer mosaics present in most cameras. 

In Section 10.4, we discuss algorithms for cutting pieces of images from one photograph 
and pasting them into others (Figure 10.1c). In Section 10.5, we describe how to generate 
novel textures from real-world samples for applications such as filling holes in images (Fig- 
ure 10.1d). We close with a brief overview of non-photorealistic rendering (Section 10.5.2), 
which can turn regular photographs into artistic renderings that resemble traditional drawings 
and paintings, and a discussion of neural network approaches to style transfer and semantic 
image synthesis (Section 10.5.3. 

One topic that we do not cover extensively in this book is novel computational sensors, 


optics, and cameras. A nice survey can be found in an article by Nayar (2006), the book by 


‘https://en.wikipedia.org/wiki/Photosynth#Mobile_apps 
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Raskar and Tumblin (2010), and research papers such as Levin, Fergus et al. (2007). Some 
related discussion can also be found in Sections 10.2 and 14.3. 

A good general-audience introduction to computational photography can be found in the 
article by Hayes (2008) as well as survey papers by Nayar (2006), Cohen and Szeliski (2006), 
Levoy (2006), and Debevec (2006).? Raskar and Tumblin (2010) give extensive coverage of 
topics in this area, with particular emphasis on computational cameras and sensors. The 
sub-field of high dynamic range imaging has its own book discussing research in this area 
(Reinhard, Heidrich et al. 2010), as well as a wonderful book aimed more at professional 
photographers (Freeman 2008).° A good survey of image matting is provided by Wang and 
Cohen (2009). 

There are also several courses on computational photography where the instructors have 
provided extensive online materials, e.g., Yannis Gkioulekas’ class at Carnegie Mellon,* 
Alyosha Efros’ class at Berkeley,’ Frédo Durand’s Computation Photography course at MIT,° 
Marc Levoy’s class at Stanford,’ and a series of SIGGRAPH courses on Computational Pho- 
tography.® 


10.1 Photometric calibration 


Before we can successfully merge multiple photographs, we need to characterize the func- 
tions that map incoming irradiance into pixel values and also the amount of noise present 
in each image. In this section, we examine three components of the imaging pipeline (Fig- 
ure 10.2) that affect this mapping. For a more comprehensive, tunable model of modern 
digital camera processing pipelines, see the recent paper by Tseng, Yu et al. (2019). 

The first is the radiometric response function (Mitsunaga and Nayar 1999), which maps 
photons arriving at the lens into digital values stored in the image file (Section 10.1.1). The 
second is vignetting, which darkens pixel values near the periphery of images, especially at 
large apertures (Section 10.1.3). The third is the point spread function, which characterizes 
the blur induced by the lens, anti-aliasing filters, and finite sensor areas (Section 10.1.4).? The 


material in this section builds on the image formation processes described in Sections 2.2.3 


2See also the two special issue journals edited by Bimber (2006) and Durand and Szeliski (2007). 

3Gulbins and Gulbins (2009) discuss related photographic techniques. 

4CMU 15-463, http://graphics.cs.cmu.edu/courses/15-463 

SBerkeley CS194-26/294-26, https://inst.eecs.berkeley.edu/—cs194-26/fa20 

$MIT 6.815/6.865, https://stellar.mit.edu/S/course/6/sp15/6.815 

Stanford CS 448A, https: //graphics.stanford.edu/courses/cs448a-10 

Shttps://web.media.mit.edu/~raskar/photo. 

9 Additional photometric camera and lens effects include sensor glare, blooming, and chromatic aberration, which 


can also be thought of as a spectrally varying form of geometric aberration (Section 2.2.3). 
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and 2.3.3, so if it has been a while since you looked at those sections, please go back and 


review them. 


10.1.1 Radiometric response function 


As we can see in Figure 10.2, a number of factors affect how the intensity of light arriving 
at the lens ends up being mapped into stored digital values. Let us ignore for now any non- 
uniform attenuation that may occur inside the lens, which we cover in Section 10.1.3. 

The first factors to affect this mapping are the aperture and shutter speed (Section 2.3), 
which can be modeled as global multipliers on the incoming light, most conveniently mea- 
sured in exposure values (log, brightness ratios). Next, the analog to digital (A/D) converter 
on the sensing chip applies an electronic gain, usually controlled by the ISO setting on your 
camera. While in theory this gain is linear, as with any electronics non-linearities may be 
present (either unintentionally or by design). Ignoring, for now, photon noise, on-chip noise, 
amplifier noise, and quantization noise, which we discuss shortly, you can often assume that 
the mapping between incoming light and the values stored in a RAW camera file (if your 
camera supports this) is roughly linear. 

If images are being stored in the more common JPEG format, the camera’s image signal 
processor (ISP) next performs Bayer pattern demosaicing (Sections 2.3.2 and 10.3.1), which 
is a mostly linear (but often non-stationary) process. Some sharpening is also often applied 
at this stage. Next, the color values are multiplied by different constants (or sometimes a 3 x 
3 color twist matrix) to perform color balancing, 1.e., to move the white point closer to pure 
white. Finally, a standard gamma is applied to the intensities in each color channel and the 
colors are converted into YCbCr format before being transformed by a DCT, quantized, and 
then compressed into the JPEG format (Section 2.3.3). Figure 10.2 shows all of these steps 
in pictorial form. 

Given the complexity of all of this processing, it is difficult to model the camera response 
function (Figure 10.3a), i.e., the mapping between incoming irradiance and digital RGB val- 
ues, from first principles. A more practical approach is to calibrate the camera by measuring 
correspondences between incoming light and final values. 

The most accurate, but most expensive, approach is to use an integrating sphere, which is 
a large (typically 1m diameter) sphere carefully painted on the inside with white matte paint. 
An accurately calibrated light at the top controls the amount of radiance inside the sphere 
(which is constant everywhere because of the sphere’s radiometry) and a small opening at the 
side allows for a camera/lens combination to be mounted. By slowly varying the current going 
into the light, an accurate correspondence can be established between incoming radiance and 


measured pixel values. The vignetting and noise characteristics of the camera can also be 
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Figure 10.2 Image sensing pipeline: (a) block diagram showing the various sources of 
noise as well as the typical digital post-processing steps; (b) equivalent signal transforms, 
including convolution, gain, and noise injection. The abbreviations are: RD = radial distor- 


tion, AA = anti-aliasing filter, CFA = color filter array, Q1 and Q2 = quantization noise. 
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Figure 10.3 Radiometric response calibration: (a) typical camera response function, 
showing the mapping between incoming log irradiance (exposure) and output eight-bit pixel 
values, for one color channel (Debevec and Malik 1997) O 1997 ACM; (b) color checker 
chart. 


simultaneously determined. 


A more practical alternative is to use a calibration chart (Figure 10.3b) such as the Mac- 
beth or Munsell ColorChecker Chart.'° The biggest problem with this approach is to ensure 
uniform lighting. One approach is to use a large dark room with a high-quality light source 
far away from (and perpendicular to) the chart. Another is to place the chart outdoors away 
from any shadows. (The results will differ under these two conditions, because the color of 


the illuminant will be different.) 


The easiest approach is probably to take multiple exposures of the same scene while the 
camera is on a tripod and to recover the response function by simultaneously estimating the 
incoming irradiance at each pixel and the response curve (Mann and Picard 1995; Debevec 
and Malik 1997; Mitsunaga and Nayar 1999). This approach is discussed in more detail in 
Section 10.2 on high dynamic range imaging. 


If all else fails, i.e., you just have one or more unrelated photos, you can use an Interna- 
tional Color Consortium (ICC) profile for the camera (Fairchild 2013).!! Even more simply, 
you can just assume that the response is linear if they are RAW files and that the images have 


a y = 2.2 non-linearity (plus clipping) applied to each RGB channel if they are JPEG images. 


10https://www.xrite.com. 
' See also the ICC Information on Profiles, https://www.color.org/info_profiles2.xalter. 
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Figure 10.4 Noise level function estimates obtained from a single color photograph (Liu, 
Szeliski et al. 2008) O 2008 IEEE. The colored curves are the estimated NLF fit as the prob- 
abilistic lower envelope of the measured deviations between the noisy piecewise-smooth im- 


ages. The ground truth NLFs obtained by averaging 29 images are shown in gray. 


10.1.2 Noise level estimation 


In addition to knowing the camera response function, it is also often important to know the 
amount of noise being injected under a particular camera setting (e.g., ISO/gain level). The 
simplest characterization of noise is a single standard deviation, usually measured in gray 
levels, independent of pixel value. A more accurate model can be obtained by estimating 
the noise level as a function of pixel value (Figure 10.4), which is known as the noise level 
function (Liu, Szeliski et al. 2008). 

As with the camera response function, the simplest way to estimate these quantities is in 
the lab, using either an integrating sphere or a calibration chart. The noise can be estimated 
either at each pixel independently, by taking repeated exposures and computing the temporal 
variance in the measurements (Healey and Kondepudy 1994), or over regions, by assuming 
that pixel values should all be the same within some region (e.g., inside a color checker 
square) and computing a spatial variance. 

This approach can be generalized to photos where there are regions of constant or slowly 
varying intensity (Liu, Szeliski et al. 2008). First, segment the image into such regions and fit 
a constant or linear function inside each region. Next, measure the (spatial) standard deviation 
of the differences between the noisy input pixels and the smooth fitted function away from 
large gradients and region boundaries. Plot these as a function of output level for each color 
channel, as shown in Figure 10.4. Finally, fit a lower envelope to this distribution to ignore 
pixels or deviations that are outliers. A fully Bayesian approach to this problem that models 
the statistical distribution of each quantity is presented by Liu, Szeliski et al. (2008). A sim- 


pler approach, which should produce useful results in most cases, is to fit a low-dimensional 
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Figure 10.5 Single image vignetting correction (Zheng, Yu et al. 2008) O 2008 IEEE: (a) 


original image with strong visible vignetting; (b) vignetting compensation as described by 


Zheng, Zhou et al. (2006); (c-d) vignetting compensation as described by Zheng, Yu et al. 
(2008). 


function (e.g., positive valued B-spline) to the lower envelope (see Exercise 10.2). 
Matsushita and Lin (2007b) present a technique for simultaneously estimating a camera’s 
response and noise level functions based on skew (asymmetries) in level-dependent noise 


distributions. Their paper also contains extensive references to previous work in these areas. 


10.1.3 Vignetting 


A common problem with using wide-angle and wide-aperture lenses is that the image tends 
to darken in the corners (Figure 10.5a). This problem is generally known as vignetting and 
comes in several different forms, including natural, optical, and mechanical vignetting (Sec- 
tion 2.2.3) (Ray 2002). As with radiometric response function calibration, the most accurate 
way to calibrate vignetting is to use an integrating sphere or a picture of a uniformly colored 
and illuminated blank wall. 

An alternative approach is to stitch a panoramic scene and to assume that the true radiance 
at each pixel comes from the central portion of each input image. This is easier to do if 
the radiometric response function is already known (e.g., by shooting in RAW mode) and 
if the exposure is kept constant. If the response function, image exposures, and vignetting 
function are unknown, they can still be recovered by optimizing a large least squares fitting 
problem (Litvinov and Schechner 2005; Goldman 2010). Figure 10.6 shows an example of 
simultaneously estimating the vignetting, exposure, and radiometric response function from 
a set of overlapping photographs (Goldman 2010). Note that unless vignetting is modeled 
and compensated, regular gradient-domain image blending (Section 8.4.4) will not create an 
attractive image. 

If only a single image is available, vignetting can be estimated by looking for slow con- 
sistent intensity variations in the radial direction. The original algorithm proposed by Zheng, 
Lin, and Kang (2006) first pre-segmented the image into smoothly varying regions and then 
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(d) 


Figure 10.6 Simultaneous estimation of vignetting, exposure, and radiometric response 
(Goldman 2010) © 2011 IEEE: (a) original average of the input images; (b) after compen- 
sating for vignetting; (c) using gradient domain blending only (note the remaining mottled 


look); (d) after both vignetting compensation and blending. 


performed an analysis inside each region. Instead of pre-segmenting the image, Zheng, Yu et 
al. (2008) compute the radial gradients at all the pixels and use the asymmetry in this distri- 
bution (because gradients away from the center are, on average, slightly negative) to estimate 
the vignetting. Figure 10.5 shows the results of applying each of these algorithms to an im- 
age with a large amount of vignetting. Exercise 10.3 has you implement some of the above 
techniques. 


10.1.4 Optical blur (spatial response) estimation 


One final characteristic of imaging systems that you should calibrate is the spatial response 
function, which encodes the optical blur that gets convolved with the incoming image to 
produce the point-sampled image. The shape of the convolution kernel, which is also known 
as the point spread function (PSF) or optical transfer function, depends on several factors, 
including lens blur and radial distortion (Section 2.2.3), anti-aliasing filters in front of the 
sensor, and the shape and extent of each active pixel area (Section 2.3) (Figure 10.2). A good 
estimate of this function is required for applications such as multi-image super-resolution and 
deblurring (Section 10.3). 

In theory, one could estimate the PSF by simply observing an infinitely small point light 
source everywhere in the image. Creating an array of samples by drilling through a dark plate 
and backlighting with a very bright light source is difficult in practice. 


A more practical approach is to observe an image composed of long straight lines or 
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1,1, 


Figure 10.7 Calibration pattern with edges equally distributed at all orientations that can 
be used for PSF and radial distortion estimation (Joshi, Szeliski, and Kriegman 2008) © 2008 
IEEE. A portion of an actual sensed image is shown in the middle and a close-up of the ideal 


pattern is on the right. 


bars, as these can be fitted to arbitrary precision. Because the location of a horizontal or 
vertical edge can be aliased during acquisition, slightly slanted edges are preferred. The 
profile and locations of such edges can be estimated to sub-pixel precision, which makes it 
possible to estimate the PSF at sub-pixel resolutions (Reichenbach, Park, and Narayanswamy 
1991; Burns and Williams 1999; Williams and Burns 2001; Goesele, Fuchs, and Seidel 2003). 
The thesis by Murphy (2005) contains a nice survey of all aspects of camera calibration, 
including the spatial frequency response (SFR), spatial uniformity, tone reproduction, color 
reproduction, noise, dynamic range, color channel registration, and depth of field. It also 


includes a description of a slant-edge calibration algorithm called sfrmat2. 


The slant-edge technique can be used to recover a 1D projection of the 2D PSF, e.g., 
slightly vertical edges are used to recover the horizontal line spread function (LSF) (Williams 
1999). The LSF is then often converted into the Fourier domain and its magnitude plotted as a 
one-dimensional modulation transfer function (MTF), which indicates which image frequen- 
cies are lost (blurred) and aliased during the acquisition process (Section 2.3.1). For most 
computational photography applications, it is preferable to directly estimate the full 2D PSF, 


as it can be hard to recover from its projections (Williams 1999). 


Figure 10.7 shows a pattern containing edges at all orientations, which can be used to 
directly recover a two-dimensional PSF. First, corners in the pattern are located by extracting 
edges in the sensed image, linking them, and finding the intersections of the circular arcs. 
Next, the ideal pattern, whose analytic form is known, is warped (using a homography) to 
fit the central portion of the input image and its intensities are adjusted to fit the ones in 
the sensed image. If desired, the pattern can be rendered at a higher resolution than the input 


image, which enables the estimation of the PSF to sub-pixel resolution (Figure 10.8a). Finally 
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Figure 10.8 Point spread function estimation using a calibration target (Joshi, Szeliski, 
and Kriegman 2008) O 2008 IEEE. (a) Sub-pixel PSFs at successively higher resolutions 


(note the interaction between the square sensing area and the circular lens blur). (b) The 


radial distortion and chromatic aberration can also be estimated and removed. (c) PSF for a 


misfocused (blurred) lens showing some diffraction and vignetting effects in the corners. 


a large linear least squares system is solved to recover the unknown PSF kernel K, 
K = arg min ||B — D(I * K)|?, (10.1) 


where B is the sensed (blurred) image, J is the predicted (sharp) image, and D is an optional 
downsampling operator that matches the resolution of the ideal and sensed images (Joshi, 
Szeliski, and Kriegman 2008). An alternative solution technique is to estimate 1D PSF pro- 
files first and to then combine them using a Radon transform (Cho, Paris et al. 2011). 

If the process of estimating the PSF is done locally in overlapping patches of the image, 
it can also be used to estimate the radial distortion and chromatic aberration induced by the 
lens (Figure 10.8b). Because the homography mapping the ideal target to the sensed image 
is estimated in the central (undistorted) part of the image, any (per-channel) shifts induced 
by the optics manifest themselves as a displacement in the PSF centers.!'? Compensating 
for these shifts eliminates both the achromatic radial distortion and the inter-channel shifts 
that result in visible chromatic aberration. The color-dependent blurring caused by chromatic 


aberration (Figure 2.21) can also be removed using the deblurring techniques discussed in 


!2This process confounds the distinction between geometric and photometric calibration. In principle, any geo- 
metric distortion could be modeled by spatially varying displaced PSFs. In practice, it is easier to fold any large 


shifts into the geometric correction component. 
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Figure 10.9 Estimating the PSF without using a calibration pattern (Joshi, Szeliski, and 
Kriegman 2008) © 2008 IEEE: (a) Input image with blue cross-section (profile) location, (b) 
Profile of sensed and predicted step edges, (c-d) Locations and values of the predicted colors 


near the edge locations. 


Section 10.3. Figure 10.8b shows how the radial distortion and chromatic aberration manifest 
themselves as elongated and displaced PSFs, along with the result of removing these effects 
in a region of the calibration target. 

The local 2D PSF estimation technique can also be used to estimate vignetting. Fig- 
ure 10.8c shows how the mechanical vignetting manifests itself as clipping of the PSF in 
the corners of the image. For the overall dimming associated with vignetting to be properly 
captured, the modified intensities of the ideal pattern need to be extrapolated from the center, 


which is best done with a uniformly illuminated target. 


When working with RAW Bayer-pattern images, the correct way to estimate the PSF is 
to only evaluate the least squares terms in (10.1) at sensed pixel values, while interpolating 
the ideal image to all values. For JPEG images, you should linearize your intensities first, 
e.g., remove the gamma and any other non-linearities in your estimated radiometric response 
function. 

What if you have an image that was taken with an uncalibrated camera? Can you still 


recover the PSF an use it to correct the image? In fact, with a slight modification, the previous 
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Figure 10.10 Sample indoor image where the areas outside the window are overexposed 


and inside the room are too dark. 


algorithms still work. 

Instead of assuming a known calibration image, you can detect strong elongated edges 
and fit ideal step edges in such regions (Figure 10.9b), resulting in the sharp image shown 
in Figure 10.9d. For every pixel that is surrounded by a complete set of valid estimated 
neighbors (green pixels in Figure 10.9c), apply the least squares formula (10.1) to estimate 
the kernel K. The resulting locally estimated PSFs can be used to correct for chromatic 
aberration (because the relative displacements between per-channel PSFs can be computed), 
as shown by Joshi, Szeliski, and Kriegman (2008). 

Exercise 10.4 provides some more detailed instructions for implementing and testing 
edge-based PSF estimation algorithms. An alternative approach, which does not require the 
explicit detection of edges but uses image statistics (gradient distributions) instead, is pre- 
sented by Fergus, Singh et al. (2006). 


10.2 High dynamic range imaging 


As we mentioned earlier in this chapter, registered images taken at different exposures can be 
used to calibrate the radiometric response function of a camera. More importantly, they can 
help you create well-exposed photographs under challenging conditions, such as brightly lit 
scenes where any single exposure contains saturated (overexposed) and dark (underexposed) 
regions (Figure 10.10). This problem is quite common, because the natural world contains a 
range of radiance values that is far greater than can be captured with any photographic sensor 
or film (Figure 10.11). Taking a set of bracketed exposures (exposures taken by a camera 
in automatic exposure bracketing (AEB) mode to deliberately under- and over-expose the 
image) gives you the material from which to create a properly exposed photograph, as shown 
in Figure 10.12 (Freeman 2008; Gulbins and Gulbins 2009; Hasinoff, Durand, and Freeman 
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1 1,500 25,000 400,000 2,000,000 


Figure 10.11 Relative brightness of different scenes, ranging from 1 inside a dark room lit 
by a monitor to 2,000,000 looking at the Sun. Photos courtesy of Paul Debevec. 


Figure 10.12 A bracketed set of shots (using the camera’s automatic exposure bracketing 


(AEB) mode) and the resulting high dynamic range (HDR) composite. 


2010; Reinhard, Heidrich et al. 2010). 

While it is possible to combine pixels from different exposures directly into a final com- 
posite (Burt and Kolczynski 1993; Mertens, Kautz, and Reeth 2007), this approach runs the 
risk of creating contrast reversals and halos. Instead, the more common approach is to pro- 


ceed in three stages: 
1. Estimate the radiometric response function from the aligned images. 
2. Estimate a radiance map by selecting or blending pixels from different exposures. 


3. Tone map the resulting high dynamic range (HDR) image back into a displayable 


gamut. 


The idea behind estimating the radiometric response function is relatively straightforward 
(Mann and Picard 1995; Debevec and Malik 1997; Mitsunaga and Nayar 1999; Reinhard, 
Heidrich et al. 2010). Suppose you take three sets of images at different exposures (shutter 


speeds), say at +2 exposure values.'* If we were able to determine the irradiance (expo- 
sure) E, at each pixel (2.102), we could plot it against the measured pixel value z;; for each 


exposure time ¢;, as shown in Figure 10.13. 


'3Changing the shutter speed is preferable to changing the aperture, as the latter can modify the vignetting and 


focus. Using +2 “f-stops” (technically, exposure values, or EVs, as f-stops refer to apertures) is usually the right 


compromise between capturing a good dynamic range and having properly exposed pixels everywhere. 
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Figure 10.13 Radiometric calibration using multiple exposures (Debevec and Malik 1997). 
Corresponding pixel values are plotted as functions of log exposures (irradiance). The curves 
on the left are shifted to account for each pixel's unknown radiance until they all line up into 


a single smooth curve. 


Unfortunately, we do not know the irradiance values E;, so these have to be estimated 
at the same time as the radiometric response function f, which can be written (Debevec and 
Malik 1997) as 

zij = f (Ei ty), (10.2) 


where t; is the exposure time for the jth image. The inverse response curve f~! is given by 
FU) = Ei tj. (10.3) 


Taking logarithms of both sides (base 2 is convenient, as we can now measure quantities in 
EVs), we obtain 
glzij) = log f+ (zij) = log E; + log tj, (10.4) 


where g = log f7* (which maps pixel values z;; into log irradiance) is the curve we are 
estimating (Figure 10.13 turned on its side). 

Debevec and Malik (1997) assume that the exposure times tj are known. (Recall that 
these can be obtained from a camera’s EXIF tags, but that they actually follow a power of 2 
progression ..., 1/128, 1/64, 1/32, 1/16, 1/8, ... instead of the marked ..., 1/125, 1/60, 1/30, 
1/15, 1/g, ... values—see Exercise 2.5.) The unknowns are therefore the per-pixel exposures 
E; and the response values gx = g(k), where g can be discretized according to the 256 
pixel values commonly observed in eight-bit images. (The response curves are calibrated 
separately for each color channel.) 

In order to make the response curve smooth, Debevec and Malik (1997) add a second- 


order smoothness constraint 


AY g" (e)? = A) lglk — 1) — 29(k) + g(k + DI, (10.5) 
> 
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(a) (b) 


Figure 10.14 Recovered response function and radiance image for a real digital camera 
(DCS460) (Debevec and Malik 1997) O 1997 ACM. 


which is similar to the one used in snakes (7.27). Because pixel values are more reliable in 
the middle of their range (and the g function becomes singular near saturation values), they 
also add a weighting (hat) function w(k) that decays to zero at both ends of the pixel value 
range, 
w(z) = 2 Amin Z S (2min + 2max)/2 (10.6) 
Zmax— Z Z> (Zmin + Zmax)/2. 


Putting all of these terms together, they obtain a least squares problem in the unknowns 
{gx} and (Ey, 


E= X w(zi5)lg(zi,5) — log E; — log tj]? + A> w(k)g"(k)?. (10.7) 
i j k 


(To remove the overall shift ambiguity in the response curve and irradiance values, the middle 
of the response curve is set to 0.) Debevec and Malik (1997) show how this can be imple- 
mented in 21 lines of MATLAB code, which partially accounts for the popularity of their 
technique. 

While Debevec and Malik (1997) assume that the exposure times t; are known exactly, 
there is no reason why these additional variables cannot be thrown into the least squares 
problem, constraining their final estimated values to lie close to their nominal values t; with 
an extra term 7) >) ;(t; — ty. 

Figure 10.14 shows the recovered radiometric response function for a digital camera along 
with select (relative) radiance values in the overall radiance map. Figure 10.15 shows the 
bracketed input images captured on color film and the corresponding radiance map. Note 
that while most research on high dynamic range imaging assumes that the radiometric (or 
camera) response function is independent of exposure, this is not actually the case. Rodríguez, 
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Figure 10.15  Bracketed set of exposures captured with a film camera and the resulting 
radiance image displayed in pseudocolor (Debevec and Malik 1997) O 1997 ACM. 


Vazquez-Corral, and Bertalmío (2019) describe how to take this into account to get improved 
results. 
While Debevec and Malik (1997) use a general second-order smooth curve g to parame- 


terize their response curve, Mann and Picard (1995) use a three-parameter function 
FE) =a4+ BE’, (10.8) 


while Mitsunaga and Nayar (1999) use a low-order (NV < 10) polynomial for the inverse 
response function g. Pal, Szeliski et al. (2004) derive a Bayesian model that estimates an 
independent smooth response function for each image, which can better model the more 
sophisticated (and hence less predictable) automatic contrast and tone adjustment performed 
in today’s digital cameras. 

Once the response function has been estimated, the second step in creating high dynamic 
range photographs is to merge the input images into a composite radiance map. If the re- 
sponse function and images were known exactly, i.e., if they were noise free, you could use 
any non-saturated pixel value to estimate the corresponding radiance by mapping it through 
the inverse response curve E = g(z). 

Unfortunately, pixels are noisy, especially under low-light conditions when fewer photons 
arrive at the sensor. To compensate for this, Mann and Picard (1995) use the derivative of the 
response function as a weight in determining the final radiance estimate, because “flatter” 
regions of the curve tell us less about the incoming irradiance. Debevec and Malik (1997) 
use a hat function (10.6) which accentuates mid-tone pixels while avoiding saturated val- 


ues. Mitsunaga and Nayar (1999) show that to maximize the signal-to-noise ratio (SNR), 
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Figure 10.16 Merging multiple exposures to create a high dynamic range composite 
(Kang, Uyttendaele et al. 2003): (a-c) three different exposures; (d) merging the exposures 
using classic algorithms (note the ghosting due to the horse’s head movement); (e) merging 


the exposures with motion compensation. 


the weighting function must emphasize both higher pixel values and larger gradients in the 
transfer function, i.e., 


w(z) = g(2)/9'(2), (10.9) 
where the weights w are used to form the final irradiance estimate 
> Wis) [9 (25) — log ti] 

do; w(zij) 


Exercise 10.1 has you implement one of the radiometric response function calibration tech- 


log E, = 


(10.10) 


niques and then use it to create radiance maps. 

Under real-world conditions, casually acquired images may not be perfectly registered 
and may contain moving objects. Ward (2003) uses a global (parametric) transform to align 
the input images, while Kang, Uyttendaele et al. (2003) present an algorithm that combines 
global registration with local motion estimation (optical flow) to accurately align the images 
before blending their radiance estimates (Figure 10.16). Because the images may have widely 
different exposures, care must be taken when estimating the motions, which must themselves 
be checked for consistency to avoid the creation of ghosts and object fragments. 

Even this approach, however, may not work when the camera is simultaneously undergo- 
ing large panning motions and exposure changes, which is a common occurrence in casually 


acquired panoramas. Under such conditions, different parts of the image may be seen at one 
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Figure 10.17 HDR merging with large amounts of motion (Eden, Uyttendaele, and Szeliski 
2006) O 2006 IEEE: (a) registered bracketed input images; (b) results after the first pass of 
image selection: reference labels, image, and tone-mapped image; (c) results after the second 


pass ofimage selection: final labels, compressed HDR image, and tone-mapped image 


or more exposures. Devising a method to blend all of these different sources while avoid- 
ing sharp transitions and dealing with scene motion is a challenging problem. One approach 
is to first find a consensus mosaic and to then selectively compute radiances in under- and 
over-exposed regions (Eden, Uyttendaele, and Szeliski 2006), as shown in Figure 10.17. Ad- 
ditional techniques for constructing and displaying high dynamic range video are discussed 
in Myszkowski, Mantiuk, and Krawczyk (2008), Tocci, Kiser et al. (2011), Sen, Kalantari 
et al. (2012), Dufaux, Le Callet et al. (2016), Banterle, Artusi et al. (2017), and Kalantari 
and Ramamoorthi (2017). Another approach is to use deep learning techniques to infer the 
high dynamic range radiance image from a single low dynamic range image (Liu, Lai et al. 
2020b). 


Some cameras, such as the Sony a550 and Pentax K-7, have started integrating multiple 
exposure merging and tone mapping directly into the camera body. In the future, the need to 
compute high dynamic range images from multiple exposures may be eliminated by advances 
in camera sensor technology (Yang, El Gamal et al. 1999; Nayar and Mitsunaga 2000; Nayar 
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and Branzoi 2003; Kang, Uyttendaele et al. 2003; Narasimhan and Nayar 2005; Tumblin, 
Agrawal, and Raskar 2005). However, the need to blend such images and to tone map them 


to lower-gamut displays is likely to remain. 


HDR image formats. Before we discuss techniques for mapping HDR images back to a 
displayable gamut, we should discuss the commonly used formats for storing HDR images. 

If storage space is not an issue, storing each of the R, G, and B values as a 32-bit IEEE 
float is the best solution. The commonly used Portable PixMap (.ppm) format, which supports 
both uncompressed ASCII and raw binary encodings of values, can be extended to a Portable 
FloatMap (.pfm) format by modifying the header. TIFF also supports full floating point 
values. 

A more compact representation is the Radiance format (.pic, .hdr) (Ward 1994), which 
uses a single common exponent and per-channel mantissas. An intermediate encoding, OpenEXR 
from ILM,'* uses 16-bit floats for each channel, which is a format supported natively on most 
modern GPUs. Ward (2004) describes these and other data formats such as LogLuv (Larson 
1998) in more detail, as do the books by Freeman (2008) and Reinhard, Heidrich et al. (2010). 
An even more recent HDR image format is the JPEG XR standard. 


10.2.1 Tone mapping 


Once a radiance map has been computed, it is usually necessary to display it on a lower gamut 
(i.e., eight-bit) screen or printer. A variety of tone mapping techniques has been developed for 
this purpose, which involve either computing spatially varying transfer functions or reducing 
image gradients to fit the available dynamic range (Reinhard, Heidrich et al. 2010). 

The simplest way to compress a high dynamic range radiance image into a low dynamic 
range gamut is to use a global transfer curve (Larson, Rushmeier, and Piatko 1997). Fig- 
ure 10.18 shows one such example, where a gamma curve is used to map an HDR image back 
into a displayable gamut. If gamma is applied separately to each channel (Figure 10.18b), the 
colors become muted (less saturated), as higher-valued color channels contribute less (pro- 
portionately) to the final color. Extracting the luminance channel from the color image using 
(2.104), applying the global mapping to the luminance channel, and then reconstituting the 
color image using (10.19) works better (Figure 10.18c). 

Unfortunately, when the image has a really wide range of exposures, this global approach 
still fails to preserve details in regions with widely varying exposures. What is needed, in- 


stead, is something akin to the dodging and burning performed by photographers in the dark- 


'4https://www.openexr.net. 
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(a) (b) 


Figure 10.18 Global tone mapping: (a) input HDR image, linearly mapped; (b) gamma 
applied to each color channel independently; (c) gamma applied to intensity (colors are less 
washed out). Original HDR image courtesy of Paul Debevec, https://www.pauldebevec.com/ 
Research/HDR. Processed images courtesy of Frédo Durand, MIT 6.815/6.865 course on 
Computational Photography. 


room. Mathematically, this is similar to dividing each pixel by the average brightness in a 
region around that pixel. 
Figure 10.19 shows how this process works. As before, the image is split into its lumi- 


nance and chrominance channels. The log luminance image 
H(x,y) = log L(x, y) (10.11) 
1s then low-pass filtered to produce a base layer 
Ay (x,y) = Blx, y) * H(x, y), (10.12) 
and a high-pass detail layer 
Hy(x,y) = H(x, y) — Hi(x, y). (10.13) 
The base layer is then contrast reduced by scaling to the desired log-luminance range, 
Hi (x,y) = s Hy(2,y) (10.14) 
and added to the detail layer to produce the new log-luminance image 
I(z,y) = Hy(z, y) + HL(z, y), (10.15) 


which can then be exponentiated to produce the tone-mapped (compressed) luminance im- 
age. Note that this process is equivalent to dividing each luminance value by (a monotonic 
mapping of) the average log-luminance value in a region around that pixel. 
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Figure 10.19 Local tone mapping using linear filters: (a) low-pass and high-pass filtered 
log luminance images and color (chrominance) image; (b) resulting tone-mapped image (af- 
ter attenuating the low-pass log luminance image) shows visible halos around the trees. Pro- 
cessed images courtesy of Frédo Durand, MIT 6.815/6.865 course on Computational Pho- 
tography. 


(b) 


Figure 10.20 Local tone mapping using a bilateral filter (Durand and Dorsey 2002): (a) 
low-pass and high-pass bilateral filtered log luminance images and color (chrominance) im- 
age; (b) resulting tone-mapped image (after attenuating the low-pass log luminance image) 
shows no halos. Processed images courtesy of Frédo Durand, MIT 6.815/6.865 course on 
Computational Photography. 
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Figure 10.21 Gaussian vs. bilateral filtering (Petschnigg, Agrawala et al. 2004) © 2004 
ACM: A Gaussian low-pass filter blurs across all edges and therefore creates strong peaks 
and valleys in the detail image that cause halos. The bilateral filter does not smooth across 


strong edges and thereby reduces halos while still capturing detail. 


Figure 10.19 shows the low-pass and high-pass log luminance image and the resulting 
tone-mapped color image. Note how the detail layer has visible Halos around the high- 
contrast edges, which are visible in the final tone-mapped image. This is because linear 
filtering, which is not edge preserving, produces halos in the detail layer (Figure 10.21). 

The solution to this problem is to use an edge-preserving filter to create the base layer. Du- 
rand and Dorsey (2002) study a number of such edge-preserving filters, including anisotropic 
and robust anisotropic diffusion, and select bilateral filtering (Section 3.3.1) as their edge- 
preserving filter. (The paper by Farbman, Fattal et al. (2008) argues in favor of using a 
weighted least squares (WLF) filter as an alternative to the bilateral filter and Paris, Ko- 
mprobst et al. (2008) reviews bilateral filtering and its applications in computer vision and 
computational photography.) Figure 10.20 shows how replacing the linear low-pass filter with 
a bilateral filter produces tone-mapped images with no visible halos. Figure 10.22 summa- 
rizes the complete information flow in this process, starting with the decomposition into log 
luminance and chrominance images, bilateral filtering, contrast reduction, and re-composition 
into the final output image. 

An alternative to compressing the base layer is to compress its derivatives, i.e., the gra- 
dient of the log-luminance image (Fattal, Lischinski, and Werman 2002). Figure 10.23 illus- 
trates this process. The log-luminance image is differentiated to obtain a gradient image 


H'(x,y) = VH(x, y). (10.16) 
This gradient image is then attenuated by a spatially varying attenuation function ®(z, y), 
G(z, y) = H'(x, y) (x,y). (10.17) 


The attenuation function I(x, y) is designed to attenuate large-scale brightness changes (Fig- 


ure 10.24a) and is designed to take into account gradients at different spatial scales (Fattal, 
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Figure 10.22 Local tone mapping using a bilateral filter (Durand and Dorsey 2002): sum- 
mary of algorithm workflow. Images courtesy of Frédo Durand, MIT 6.815/6.865 course on 
Computational Photography. 


Lischinski, and Werman 2002). 
After attenuation, the resulting gradient field is re-integrated by solving a first-order vari- 
ational (least squares) problem, 


min | f V1.9) - Gle, y)|?de dy (10.18) 


to obtain the compressed log-luminance image I(x, y). This least squares problem is the same 
that was used for Poisson blending (Section 8.4.4) and was first introduced in our study of reg- 
ularization (Section 4.2, 4.24). It can efficiently be solved using techniques such as multigrid 
and hierarchical basis preconditioning (Fattal, Lischinski, and Werman 2002; Szeliski 2006b; 
Farbman, Fattal et al. 2008; Krishnan and Szeliski 2011; Krishnan, Fattal, and Szeliski 2013). 
Once the new luminance image has been computed, it is combined with the original color im- 


age using 


ony 
Cout = (55) Lout, (10.19) 
where C = (R, G, B) and Lin and Lout are the original and compressed luminance images. 
The exponent s controls the saturation of the colors and is typically in the range s € [0.4, 0.6] 
(Fattal, Lischinski, and Werman 2002). Figure 10.24b shows the final tone-mapped color 
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Figure 10.23 Gradient domain tone mapping (Fattal, Lischinski, and Werman 2002) © 
2002 ACM. The original image with a dynamic range of 2415:1 is first converted into the log 
domain, H (x), and its gradients are computed, H' (x). These are attenuated (compressed) 
based on local contrast, G(x), and integrated to produce the new logarithmic exposure image 
I(x), which is exponentiated to produce the final intensity image, whose dynamic range is 
Lal, 


image, which shows no visible halos despite the extremely large variation in input radiance 
values. 

Yet another alternative to these two approaches is to perform the local dodging and burn- 
ing using a locally scale-selective operator (Reinhard, Stark et al. 2002). Figure 10.25 shows 
how such a scale selection operator can determine a radius (scale) that only includes similar 
color values within the inner circle while avoiding much brighter values in the surrounding 
circle. In practice, a difference of Gaussians normalized by the inner Gaussian response is 
evaluated over a range of scales, and the largest scale whose metric is below a threshold is 
selected (Reinhard, Stark et al. 2002). 

Another recently developed approach to tone mapping based on multi-resolution decom- 
position is the Local Laplacian Filter (Paris, Hasinoff, and Kautz 2011), which we introduced 
in Section 3.5.3. Coefficients in a Laplacian pyramid are constructed from locally contrast- 
adjusted patches, which enables the technique to not only tone map HDR images, but also to 
enhance local details and do style transfer (Aubry, Paris et al. 2014). 

What all of these techniques have in common is that they adaptively attenuate or brighten 
different regions of the image so that they can be displayed in a limited gamut without loss of 


contrast. Lischinski, Farbman et al. (2006) introduce an interactive technique that performs 
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Figure 10.24 Gradient domain tone mapping (Fattal, Lischinski, and Werman 2002) © 
2002 ACM: (a) attenuation map, with darker values corresponding to more attenuation; (b) 


final tone-mapped image. 
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Figure 10.26 Interactive local tone mapping (Lischinski, Farbman et al. 2006) O 2006 
ACM: (a) user-drawn strokes with associated exposure values g(x,y); (b) corresponding 


piecewise-smooth exposure adjustment map f(x, y). 
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this operation by interpolating a set of sparse user-drawn adjustments (strokes and associ- 
ated exposure value corrections) to a piecewise-continuous exposure correction map (Fig- 
ure 10.26). The interpolation is performed by minimizing a locally weighted least squares 
(WLS) variational problem, 


min Pull — 9(e.y)lPae dy +> foso AV IMP dy, 
(10.20) 
where g(x,y) and f(x,y) are the input and output log exposure (attenuation) maps (Fig- 
ure 10.26). The data weighting term wa(x, y) is 1 at stroke locations and 0 elsewhere. The 


smoothness weighting term ws(x, y) is inversely proportional to the log-luminance gradient, 


1 


= == +... — 10.21 
[VA (e + € ( 


Ws 


and hence encourages the f(x, y) map to be smoother in low-gradient areas than along high- 
gradient discontinuities.'* The same approach can also be used for fully automated tone map- 
ping by setting target exposure values at each pixel and allowing the weighted least squares 
to convert these into piecewise smooth adjustment maps. 

The weighted least squares algorithm, which was originally developed for image col- 
orization applications (Levin, Lischinski, and Weiss 2004), has since been applied to general 
edge-preserving smoothing in applications such as contrast enhancement (Bae, Paris, and Du- 
rand 2006) and tone mapping (Farbman, Fattal et al. 2008) where the bilateral filtering was 
previously used. It can also be used to perform HDR merging and tone mapping simultane- 
ously (Raman and Chaudhuri 2007, 2009). 

Given the wide range of locally adaptive tone mapping algorithms that have been devel- 
oped, which ones should be used in practice? Freeman (2008) provides a great discussion 
of commercially available algorithms, their artifacts, and the parameters that can be used to 
control them. He also has a wealth of tips for HDR photography and workflow. I highly rec- 
ommend his book for anyone contemplating additional research (or personal photography) in 


this area. 


10.2.2 Application: Flash photography 


While high dynamic range imaging combines images of a scene taken at different exposures, 
it is also possible to combine flash and non-flash images to achieve better exposure and color 


balance and to reduce noise (Eisemann and Durand 2004; Petschnigg, Agrawala et al. 2004). 


'5Tn practice, the x and y discrete derivatives are weighted separately (Lischinski, Farbman ef al. 2006). Their 


default parameter settings are A = 0.2, a = 1, and e = 0.0001. 
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(a) (b) (d) 


Figure 10.27 Detail transfer in flash/no-flash photography (Petschnigg, Agrawala et al. 
2004) O 2004 ACM: (a) details of input ambient A and flash F images; (b) joint bilaterally 
filtered no-flash image ANF; (c) detail layer FP computed from the flash image F; (d) 
final merged image APP. 


(c) 


The problem with flash images is that the color is often unnatural (it fails to capture the 
ambient illumination), there may be strong shadows or specularities, and there is a radial 
falloff in brightness away from the camera (Figures 10.1b and 10.27a). Non-flash photos 
taken under low light conditions often suffer from excessive noise (because of the high ISO 
gains and low photon counts) and blur (due to longer exposures). Is there some way to 
combine a non-flash photo taken just before the flash goes off with the flash photo to produce 
an image with good color values, sharpness, and low noise? In fact, the discontinued FujiFilm 
FinePix F40fd camera takes a pair of flash and no flash images in quick succession; however, 
it only lets you decide to keep one of them. 

Petschnigg, Agrawala et al. (2004) approach this problem by first filtering the no-flash 
(ambient) image A with a variant of the bilateral filter called the joint bilateral filter'? in 
which the range kernel (3.36) 


(10.22) 


ee ee ( IFG 9) — aor 


2 
202 


is evaluated on the flash image F instead of the ambient image A, as the flash image is less 
noisy and hence has more reliable edges (Figure 10.27b). Because the contents of the flash 
image can be unreliable inside and at the boundaries of shadows and specularities, these are 
detected and a regular bilaterally filtered image A?**° is used instead (Figure 10.28). 

The second stage of their algorithm computes a flash detail image 


Fe 


pPetail = 
= [Base + e 


(10.23) 


16Eisemann and Durand (2004) call this the cross bilateral filter. 
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Figure 10.28 = Flash/no-flash photography algorithm (Petschnigg, Agrawala et al. 2004) © 
2004 ACM. The ambient (no-flash) image A is filtered with a regular bilateral filter to produce 
APase, which is used in shadow and specularity regions, and a joint bilaterally filtered noise 
reduced image ANF. The flash image F is bilaterally filtered to produce a base image FP** 
and a detail (ratio) image Fetal which is used to modulate the denoised ambient image. 
The shadow/specularity mask M is computed by comparing linearized versions of the flash 


and no-flash images. 


where F'?2°° is a bilaterally filtered version of the flash image F and e = 0.02. This detail im- 
age (Figure 10.27c) encodes details that may have been filtered away from the noise-reduced 
no-flash image A®, as well as additional details created by the flash camera, which often 
add crispness. The detail image is used to modulate the noise-reduced ambient image AN” 
to produce the final results 


Afinal = (1 _ MA epee + M A®ase (10.24) 


shown in Figures 10.1b and 10.27d. 

Eisemann and Durand (2004) present an alternative algorithm that shares some of the 
same basic concepts. Both papers are well worth reading and contrasting (Exercise 10.6). 

Flash images can also be used for a variety of additional applications such as extracting 
more reliable foreground mattes of objects (Raskar, Tan et al. 2004; Sun, Li ef al. 2006). 
Given a large enough training set, it is also possible to decompose single flash images into 
their ambient and flash illumination components, which can be used to adjust their appearance 
(Aksoy, Kim et al. 2018). Flash photography is just one instance of the more general topic 
of active illumination, which is discussed in more detail by Raskar and Tumblin (2010) and 
Ikeuchi, Matsushita et al. (2020). 
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10.3 Super-resolution, denoising, and blur removal 


While high dynamic range imaging enables us to obtain an image with a larger dynamic 
range than a single regular image, super-resolution enables us to create images with higher 
spatial resolution and less noise than regular camera images (Chaudhuri 2001; Park, Park, and 
Kang 2003; Capel and Zisserman 2003; Capel 2004; van Ouwerkerk 2006; Anwar, Khan, 
and Barnes 2020). Most commonly, super-resolution refers to the process of aligning and 
combining several input images to produce such high-resolution composites (Irani and Peleg 
1991; Cheeseman, Kanefsky et al. 1993; Pickup, Capel et al. 2009; Wronski, Garcia-Dorado 
et al. 2019). However, some techniques can super-resolve a single image (Freeman, Jones, 
and Pasztor 2002; Baker and Kanade 2002; Fattal 2007; Anwar, Khan, and Barnes 2020) and 
are hence closely related to techniques for removing blur (Sections 3.4.1 and 3.4.2). Anwar, 
Khan, and Barnes (2020) provide a comprehensive review of single image super-resolution 
techniques with a particular focus on recent deep learning-based approaches. 

A traditional way to formulate the super-resolution problem is to write down the stochastic 
image formation equations and image priors and to then use Bayesian inference to recover the 
super-resolved (original) sharp image. We can do this by generalizing the image formation 
equations used for image deblurring (Section 3.4.1), which we also used for blur kernel (PSF) 
estimation (Section 10.1.4). In this case, we have several observed images [o (x)), as well 
as an image warping function hz (x) for each observed image (Figure 3.46). Combining all 


of these elements, we get the (noisy) observation equations!’ 


on(x) = D{b(x) * s(lag(x))} + ne (x), (10.25) 


where D is the downsampling operator, which operates after the super-resolved (sharp) 
warped image s(hy(x)) has been convolved with the blur kernel b(x). The above image 


formation equations lead to the following least squares problem, 


Y) lor (x) — D{br(x) + shr (0) HP. (10.26) 
k 


In most super-resolution algorithms, the alignment (warping) hy, is estimated using one of 
the input frames as the reference frame; either feature-based (Section 8.1.3) or direct (image- 
based) (Section 9.2) parametric alignment techniques can be used. (A few algorithms, such 
as those described by Schultz and Stevenson (1996), Capel (2004), and Wronski, Garcia- 


Dorado et al. (2019) use dense (per-pixel flow) estimates.) A better approach is to re-compute 


'Tt is also possible to add an unknown bias—gain term to each observation (Capel 2004), as was done for motion 


estimation in (9.8). 
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the alignment by directly minimizing (10.26) once an initial estimate of s(x) has been com- 
puted (Hardie, Barnard, and Armstrong 1997) or to marginalize out the motion parameters 
altogether (Pickup, Capel et al. 2007). 

The point spread function (blur kernel) bx is either inferred from knowledge of the image 
formation process (e.g., the amount of motion or defocus blur and the camera sensor optics) 
or calibrated from a test image or the observed images {0x} using one of the techniques 
described in Section 10.1.4. The problem of simultaneously inferring the blur kernel and the 
sharp image is known as blind image deconvolution (Kundur and Hatzinakos 1996; Levin 
2006; Levin, Weiss et al. 2011; Campisi and Egiazarian 2017).'® 

Given an estimate of hr and bx (x), (10.26) can be re-written using matrix/vector notation 


as a large sparse least squares problem in the unknown values of the super-resolved pixels s, 


S llos — DB, Wsl|?. (10.27) 
k 


(Recall from (3.75) that once the warping function hy, is known, values of s(hy(x)) depend 
linearly on those in s(x).) An efficient way to solve this least squares problem is to use 
preconditioned conjugate gradient descent (Capel 2004), although some earlier algorithms, 
such as the one developed by Irani and Peleg (1991), used regular gradient descent (also 
known as iterative back projection (IBP) in the computed tomography literature). 

The above formulation assumes that warping can be expressed as a simple (sinc or bicu- 
bic) interpolated resampling of the super-resolved sharp image, followed by a stationary 
(spatially invariant) blurring (PSF) and area integration process. However, if the surface is 
severely foreshortened, we have to take into account the spatially varying filtering that occurs 
during the image warping (Section 3.6.1), before we can then model the PSF induced by the 
optics and camera sensor (Wang, Kang ef al. 2001; Capel 2004). 

How well does this least squares (MLE) approach to super-resolution work? In practice, 
this depends a lot on the amount of blur and aliasing in the camera optics, as well as the accu- 
racy in the motion and PSF estimates (Baker and Kanade 2002; Jiang, Wong, and Bao 2003; 
Capel 2004). Less blurring and more aliasing means that there is more (aliased) high fre- 
quency information available to be recovered. However, because the least squares (maximum 
likelihood) formulation uses no image prior, a lot of high-frequency noise can be introduced 
into the solution (Figure 10.29c). 

For this reason, classic super-resolution algorithms assume some form of image prior. The 
simplest of these is to place a penalty on the image derivatives similar to Equations (4.29) and 


'8Notice that there is a chicken-and-egg problem if both the blur kernel and the super-resolved image are unknown. 
This can be “broken” either using structural assumptions about the sharp image, e.g., the presence of edges (Joshi, 


Szeliski, and Kriegman 2008) or prior models for the image, such as edge sparsity (Fergus, Singh et al. 2006). 
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Figure 10.29 = Super-resolution results using a variety of image priors (Capel 2001): (a) 
Low-res ROI (bicubic 3 x zoom); (b) average image; (c) MLE  1.25x pixel-zoom; (d) 
simple |\x\|? prior (A = 0.004); (e) GMRF (A = 0.003); (f) HMRF (A = 0.01, œ = 0.04). 
10 images are used as input and a 3 x super-resolved image is produced in each case, except 
for the MLE result in (c). 


(4.42), e.g., 


Y pr(s(é, 5) sli + 1, 9)) + ppls(i, 4) — alij +1). (10.28) 
(4,3) 


As discussed in Section 4.3, when pp is quadratic, this is a form of Tikhonov regulariza- 
tion (Section 4.2), and the overall problem is still linear least squares. The resulting prior 
image model is a Gaussian Markov random field (GMRF), which can be extended to other 
(e.g., diagonal) differences, as in Capel (2004) and Figure 10.29. 

Unfortunately, GMRFs tend to produce solutions with visible ripples, which can also be 
interpreted as increased noise sensitivity in middle frequencies. A better image prior is a 
robust prior that encourages piecewise continuous solutions (Black and Rangarajan 1996), 
see Appendix B.3. Examples of such priors include the Huber potential (Schultz and Steven- 
son 1996; Capel and Zisserman 2003), which is a blend of a Gaussian with a longer-tailed 
Laplacian, and the even sparser (heavier-tailed) hyper-Laplacians used by Levin, Fergus et al. 
(2007) and Krishnan and Fergus (2009). It is also possible to learn the parameters for such 
priors using cross-validation (Capel 2004; Pickup 2007). 


While sparse (robust) derivative priors can reduce rippling effects and increase edge 
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Figure 10.30 Example-based super-resolution: (a) original 32 x 32 low-resolution image; 
(b) example-based super-resolved 256 x 256 image (Freeman, Jones, and Pasztor 2002) © 
2002 IEEE; (c) upsampling via imposed edge statistics (Fattal 2007) © 2007 ACM. 


sharpness, they cannot hallucinate higher-frequency texture or details. To do this, a train- 
ing set of sample images can be used to find plausible mappings between low-frequency 
originals and the missing higher frequencies. Inspired by some of the example-based texture 
synthesis algorithms we discuss in Section 10.5, the example-based super-resolution algo- 
rithm developed by Freeman, Jones, and Pasztor (2002) uses training images to learn the 
mapping between local texture patches and missing higher-frequency details. To ensure that 
overlapping patches are similar in appearance, a Markov random field is used and optimized 
using either belief propagation (Freeman, Pasztor, and Carmichael 2000) or a raster-scan de- 
terministic variant (Freeman, Jones, and Pasztor 2002). Figure 10.30 shows the results of 
hallucinating missing details using this approach and compares these results to a more recent 
algorithm by Fattal (2007). This latter algorithm learns to predict oriented gradient magni- 
tudes in the finer resolution image based on a pixel’s location relative to the nearest detected 
edge along with the corresponding edge statistics (magnitude and width). It is also possible 
to combine sparse (robust) derivative priors with example-based super-resolution, as shown 
by Tappen, Russell, and Freeman (2003). 

An alternative (but closely related) form of hallucination is to recognize the parts of a 
training database of images to which a low-resolution pixel might correspond. In their work, 
Baker and Kanade (2002) use local derivative-of-Gaussian filter responses as features and 
then match parent structure vectors in a manner similar to De Bonet (1997).!° The high- 
frequency gradient at each recognized training image location is then used as a constraint on 


the super-resolved image, along with the usual reconstruction (prediction) Equation (10.26). 


'For face super-resolution, where all the images are pre-aligned, only corresponding pixels in different images 


are examined. 
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Figure 10.31 Recognition-based super-resolution (Baker and Kanade 2002) © 2002 IEEE. 
The Hallucinated column shows the results of the recognition-based algorithm compared to 
the regularization-based approach of Hardie, Barnard, and Armstrong (1997). 


Figure 10.31 shows the result of hallucinating higher-resolution faces from lower-resolution 
inputs; Baker and Kanade (2002) also show examples of super-resolving known-font text. 
Exercise 10.7 gives more details on how to implement and test one or more of these super- 
resolution techniques. 


The latest trend in super-resolution has been the use of deep neural networks to directly 
predict super-resolved images. This approach, which began with the seminal work of Dong, 
Loy et al. (2016), has generated dozens of different DNNs and architectures, including the 
Deep Learning Super Sampling hardware embedded in the latest NVIDIA graphics cards 
(Burnes 2020). The recent survey on single-image super-resolution by Anwar, Khan, and 
Barnes (2020) categorizes these algorithms into a taxonomy (Figure 10.32a), provides a pic- 
torial summary network architectures (Figure 10.32b), and compares the super-resolution 
results both numerically and visually on noise-free known bicubic-kernel decimation image 
datasets. While the results shown in Figure 10.33 show dramatic differences between algo- 
rithms, it is not clear how well these algorithms generalize to real-world noisy input with 
unknown blur kernels. The RealSR real-world super-resolution dataset developed by (Cai, 
Zeng et al. 2019), shot using a zoom lens on a digital camera, provides a means to test (and 
train) algorithms on real imaging degradations. This dataset forms the basis for the NTIRE 
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Figure 10.32 Recent deep neural network algorithms for single image super-resolution 
(Anwar, Khan, and Barnes 2020) © 2020 ACM: (a) a taxonomy of the algorithms based on 
their general approach; (b) schematic architectures for a subset of the algorithms. 
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Figure 10.33 Visual comparison of some super-resolution algorithms (Anwar, Khan, and 
Barnes 2020) © 2020 ACM. 
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Figure 10.34 Timeline of denoising algorithms from Gu and Timofte (2019) © 2019 
Springer. 


challenges on real image super-resolution (Cai, Gu et al. 2019), which provide empirical 
comparisons of recent deep network-based algorithms. 


While single-image super-resolution is interesting, much more impressive (and practical) 
results can be obtained by building a multi-frame super-resolution algorithm directly into a 
smartphone camera, where the processing can be done jointly with the image demosaicing. 
We discuss recent work by Wronski, Garcia-Dorado et al. (2019) in Section 10.3.1 and Fig- 
ure 10.38 on color image demosaicing. It is also possible to upsample videos temporally 
using frame interpolation (Section 9.4.1), spatially using video super-resolution (Liu and Sun 
2013; Kappeler, Yoo et al. 2016; Shi, Caballero et al. 2016; Tao, Gao et al. 2017; Nah, Timo- 
fte et al. 2019; Isobe, Jia et al. 2020; Li, Tao et al. 2020), or simultaneously in both the spatial 
and temporal dimensions (Kang, Jo et al. 2020). 
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Single and multi-frame denoising 


Image denoising is one of the classic problems in image processing and computer vision (Per- 
ona and Malik 1990b; Rudin, Osher, and Fatemi 1992; Buades, Coll, and Morel 2005b). Over 
the last four decades, hundreds of algorithms have been developed, and the field continues to 
be actively studied, with recent algorithms all being based on deep neural networks. 

The latest benchmark for comparing image denoising algorithms, the NTIRE 2020 Chal- 
lenge on Real Image Denoising (Abdelhamed, Afifi et al. 2020), is based on a smartphone 
image denoising dataset (SIDD) (Abdelhamed, Lin, and Brown 2018), where the noise-free 
ground truth images were obtained by averaging sets of 150 noisy images. This provides 
much more realistic and varied real-world noise and image processing models than the syn- 
thetically noised images used in most previous benchmarks (with the exception of (Plótz and 
Roth 2017)). 

A recent (brief) survey on image denoising by Gu and Timofte (2019) includes the fol- 
lowing seminal denoising papers”! (see Figure 10.34 for a timeline): 


total variation (TV) (Rudin, Osher, and Fatemi 1992; Chan, Osher, and Shen 2001; 
Chambolle 2004; Chan and Shen 2005), 


Gaussian scale mixtures (GSMs) (Lyu and Simoncelli 2009), 


Field of Experts (FoE) (Roth and Black 2009), 


non-local means (NLM) (Buades, Coll, and Morel 2005a,b), 


BM3D (Dabov, Foi et al. 2007), 


sparse overcomplete dictionaries (K-SVD) (Aharon, Elad, and Bruckstein 2006), 


expected patch log likelihood (EPLL) (Zoran and Weiss 2011), 


an MLP denoiser (Burger, Schuler, and Harmeling 2012), 


weighted nuclear norm minimization (WNNM) (Gu, Zhang et al. 2014), 


shrinkage fields (CSF) (Schmidt and Roth 2014), 


Trainable Nonlinear Reaction Diffusion (TNRD) (Chen and Pock 2016), 


e across-channel noise model for color images (Nam, Hwang et al. 2016), 


2https://data.vision.ee.ethz.ch/cvl/ntire20/, https://data.vision.ee.ethz.ch/cvl/aim20/ 
21T have added a few more papers from the ICCV tutorial by Brown (2019) and a few additional recommendations 
from Abdelrahman Abdelhamed. 
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e a denoising residual CNN (DnCNN) (Zhang, Zuo et al. 2017), which is now considered 
the baseline for DNN denoising, and 


e learning to see in the dark (Chen, Chen et al. 2018). 


While these results show dramatic improvement over time, today’s imaging sensors for 
the most part produce relatively clean images, except in low-light situations, where the ISO 
camera gain must be increased and the read and photon noise become comparable to the sig- 
nal strength. In this regime, it is preferable, if possible, to take a rapid burst of images at 
low ISO (gain) and then combine these to obtain a denoised image (Hasinoff, Kutulakos et 
al. 2009; Hasinoff, Durand, and Freeman 2010; Liu, Yuan et al. 2014). This approach was 
generalized and applied to low-light photography in the HDR+ system of Hasinoff, Sharlet et 
al. (2016). More recent work along these lines, some of which combines low-light photog- 
raphy, demosaicing, and in some cases super-resolution, includes papers by Godard, Matzen, 
and Uyttendaele (2018), Chen, Chen et al. (2018), Mildenhall, Barron et al. (2018), Wron- 
ski, Garcia-Dorado et al. (2019), and (Rong, Demandolx et al. 2020). Liba, Murthy et al. 
(2019) describe the technology that underlies Google’s Night Sight feature, which not only 
robustly aligns and merges different moving regions together under noisy conditions, but also 
introduces the concept of “motion metering” to determine the optimal number of frames and 


exposure times. 


Blur removal 


Under favorable conditions, super-resolution and related upsampling techniques can increase 
the resolution of a well-photographed image or image collection. When the input images are 
blurry to start with, the best one can often hope for is to reduce the amount of blur. This 
problem is closely related to super-resolution, with the biggest differences being that the blur 
kernel b is usually much larger (and unknown) and the downsampling factor D is unity. 

A large literature on image deblurring exists; some publications with nice literature re- 
views include those by Fergus, Singh et al. (2006), Yuan, Sun et al. (2008), and Joshi, Zitnick 
et al. (2009). It is also possible to reduce blur by combining sharp (but noisy) images with 
blurrier (but cleaner) images (Yuan, Sun ef al. 2007), take lots of quick exposures (Hasinoff 
and Kutulakos 2011; Hasinoff, Kutulakos et al. 2009; Hasinoff, Durand, and Freeman 2010), 
or use coded aperture techniques to simultaneously estimate depth and reduce blur (Levin, 
Fergus et al. 2007; Zhou, Lin, and Nayar 2009). When available, data from on-board IMUs 
(inertial measurement units) can be used for blur kernel determination (Joshi, Kang et al. 
2010). It is also possible to use information from dual-pixel sensors to aid the deblurring of 


misfocused images (Abuolaim and Brown 2020). 
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Figure 10.35 Bayer RGB pattern: (a) color filter array layout; (b) interpolated pixel val- 


ues, with unknown (guessed) values shown as lower case. 


The past decade has seen the introductions of a large number of new learning-based de- 
blurring algorithms (Sun, Cao et al. 2015; Schuler, Hirsch et al. 2016; Nah, Hyun Kim, and 
Mu Lee 2017; Kupyn, Budzan et al. 2018; Tao, Gao et al. 2018; Zhang, Dai et al. 2019; 
Kupyn, Martyniuk et al. 2019). There has also been some work on artificially re-introducing 
texture in deblurred images to better match the expected image statistics (Cho, Joshi et al. 


2012), 1.e., what is now commonly called perceptual loss (Section 5.3.4). 


10.3.1 Color image demosaicing 


A special case of super-resolution, which is used daily in most digital still cameras, is the 
process of demosaicing samples from a color filter array (CFA) into a full-color RGB image. 
Figure 10.35 shows the most commonly used CFA known as the Bayer pattern, which has 
twice as many green (G) sensors as red and blue sensors. 

The process of going from the known CFA pixels values to the full RGB image is quite 
challenging. Unlike regular super-resolution, where small errors in guessing unknown values 
usually show up as blur or aliasing, demosaicing artifacts often produce spurious colors or 
high-frequency patterned zippering, which are quite visible to the eye (Figure 10.36b). 

Over the years, a variety of techniques have been developed for image demosaicing (Kim- 
mel 1999). Longere, Delahunt et al. (2002), Tappen, Russell, and Freeman (2003), and Li, 
Gunturk, and Zhang (2008) provide surveys of the field as well as comparisons of previously 
developed techniques using perceptually motivated metrics. To reduce the zippering effect, 
most techniques use the edge or gradient information from the green channel, which is more 
reliable because it is sampled more densely, to infer plausible values for the red and blue 


channels, which are more sparsely sampled. 
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(a) (b) 


(c) (d) 


Figure 10.36 CFA demosaicing results (Bennett, Uyttendaele et al. 2006) O 2006 Springer: 
(a) original full-resolution image (a color subsampled version is used as the input to the 
algorithms); (b) bilinear interpolation results, showing color fringing near the tip of the blue 
crayon and zippering near its left (vertical) edge; (c) the high-quality linear interpolation 
results of Malvar, He, and Cutler (2004) (note the strong halo/checkerboard artifacts on the 
yellow crayon); (d) using the local two-color prior of Bennett, Uyttendaele et al. (2006). 


To reduce color fringing, some techniques perform a color space analysis, e.g., using me- 
dian filtering on color opponent channels (Longere, Delahunt et al. 2002). The approach of 
Bennett, Uyttendaele et al. (2006) computes local two-color models from an initial demosaic- 


ing result, using a moving 5 x 5 window to find the two dominant colors (Figure 10.37).2 


Once the local color model has been estimated at each pixel, a Bayesian approach is 
then used to encourage pixel values to lie along each color line and to cluster around the 
dominant color values, which reduces halos (Figure 10.36d). The Bayesian approach also 
supports the simultaneous application of demosaicing, denoising, and super-resolution, i.e., 
multiple CFA inputs can be merged into a higher-quality full-color image. More recent work 
that combines demosaicing and denoising includes papers by Chatterjee, Joshi et al. (2011) 
and Gharbi, Chaurasia et al. (2016). The NTIRE 2020 Challenge on Real Image Denoising 
(Abdelhamed, Afifi et al. 2020) includes a track on denoising RAW (i.e., color filter array) 
images. There’s also an interesting paper by Jin, Facciolo, and Morel (2020) studying whether 


22Previous work on locally linear color models (Klinker, Shafer, and Kanade 1990; Omer and Werman 2004) 
focuses on color and illumination variation within a single material, whereas Bennett, Uyttendaele et al. (2006) use 


the two-color model to describe variations across color (material) edges. 
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Figure 10.37  Two-color model computed from a collection of local 5 x 5 neighborhoods 
(Bennett, Uyttendaele et al. 2006) O 2006 Springer. After two-means clustering and reprojec- 
tion along the line joining the two dominant colors (red dots), the majority of the pixels fall 
near the fitted line. The distribution along the line, projected along the RGB axes, is peaked 


at 0 and 1, the two dominant colors. 


denoising should be applied before or after demosaicing. 

As we mentioned before, burst photography (Cohen and Szeliski 2006; Hasinoff, Kutu- 
lakos et al. 2009; Hasinoff and Kutulakos 2011), i.e., the combination of rapidly acquired 
sequences of images, is becoming ubiquitous in smartphone cameras. A wonderful example 
of a recent system that performs joint demosaicing and multi-frame super-resolutions, based 
on locally adapted kernel functions (Figure 10.38), is the paper by Wronski, Garcia-Dorado 
et al. (2019), which underlies the Super Res Zoom feature in Google’s Pixel smartphones. 


10.3.2 Lens blur (bokeh) 


The ability to create a shallow depth-of-field photograph using a large aperture (Section 2.2.3) 
has always been one of the advantages of large-format, e.g., single lens reflex (SLR), cam- 
eras. The desire to artificially simulate refocusable, shallow depth-of-field cameras was one 
of the driving impetuses behind computational photography (Levoy 2006) and led to the de- 
velopment of lightfield cameras (Ng, Levoy et al. 2005), which we discuss in Section 14.3.4. 
Although some commercial models, such as the Lytro, were produced, the ability to create 
such images with smartphone cameras has only recently become widespread.? 

The Apple iPhone 7 Plus with its dual (wide/telephoto) lens was the first smartphone to 
introduce this feature, which they called the Portrait mode. Although the technical details 
behind this feature have never been published, the algorithm that estimates the depth image 
(which can be read out of the metadata in the portrait images) probably uses some combi- 


23 An earlier feature called Google Lens Blur, which required moving the camera in a pattern, https://ai.googleblog. 


com/2014/04/lens-blur-in-new-google-camera-app.html, was never widely used. 
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Figure 10.38 Hand-held multi-frame super-resolution (Wronski, Garcia-Dorado et al. 
2019) © 2019 ACM. Processing pipeline, showing: (a) the captured burst of raw (Bayer 
CFA) images; (b) local gradients used to compute oriented kernels (c); (d) motion estimates, 
combined with local statistics (e) to compute blend weights (f). Results from (i) the previous 
method of Hasinoff, Sharlet et al. (2016) and (j) Wronski, Garcia-Dorado et al. (2019). 


nation of stereo matching and deep learning. A little later, Google released its own Portrait 
Mode, which uses the dual pixels, originally designed for focusing the camera optics, along 
with person segmentation to compute a depth map, as described in the paper by Wadhwa, 
Garg et al. (2018). Once the depth map has been estimated, a fast approximation to a back- 
to-front blurred over compositing operator is used to correctly blur the background without 
including foreground colors. More recently Garg, Wadhwa et al. (2019) have improved the 
quality of the depth estimation using a deep network, and also used two lenses (along with 
dual pixels) to produce even higher-quality depth maps (Zhang, Wadhwa et al. 2020). 

One final word on bokeh, which is the term photographers use to describe the shape of the 
glints or highlights that appear in an image. This shape is determined by the configuration of 
the aperture blades that control how much light enters the lens (on larger-format cameras). 
Traditionally, these were made with straight metal leaves, which resulted in polygonal aper- 
tures, but they were then mostly replaced by curved leaves to produce a more circular shape. 
When using computational photography, we can use whatever shape is pleasing to the pho- 
tographer, but preferably not a Gaussian blur, which does not correspond to any real aperture 
and produces indistinct highlights. The paper by Wadhwa, Garg et al. (2018) uses a circular 
bokeh for their depth-of-field effect and a more recent version performs the computations in 
the HDR (radiance) space to produce more accurate highlights.” 


Ahttps://ai.googleblog.com/2019/12/improvements-to-portrait-mode-on- google.html 
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Figure 10.39 Softening a hard segmentation boundary (border matting) (Rother, Kol- 
mogorov, and Blake 2004) © 2004 ACM: (a) the region surrounding a segmentation boundary 
where pixels of mixed foreground and background colors are visible; (b) pixel values along 
the boundary are used to compute a soft alpha matte; (c) at each point along the curve t, a 
displacement A and a width o are estimated. 


10.4 Image matting and compositing 


Image matting and compositing is the process of cutting a foreground object out of one im- 
age and pasting it against a new background (Smith and Blinn 1996; Wang and Cohen 2009). 
It is commonly used in television and film production to composite a live actor in front of 
computer-generated imagery such as weather maps or 3D virtual characters and scenery 
(Wright 2006; Brinkmann 2008), and it has recently become a popular feature in video con- 
ferencing systems. 

We have already seen a number of tools for interactively segmenting objects in an image, 
including snakes (Section 7.3.1), scissors (Section 7.3.1), and GrabCut segmentation (Sec- 
tion 4.3.2). While these techniques can generate reasonable pixel-accurate segmentations, 
they fail to capture the subtle interplay of foreground and background colors at mixed pixels 
along the boundary (Szeliski and Golland 1999) (Figure 10.39a). 

To successfully copy a foreground object from one image to another without visible dis- 
cretization artifacts, we need to pull a matte, i.e., to estimate a soft opacity channel a and 
the uncontaminated foreground colors F from the input composite image C. Recall from 
Section 3.1.3 (Figure 3.4) that the compositing equation (3.8) can be written as 


C=(1-a)B+aF. (10.29) 


This operator attenuates the influence of the background image B by a factor (1 — a) and 
then adds in the (partial) color values corresponding to the foreground element F. 

While the compositing operation is easy to implement, the reverse matting operation of 
estimating F, a, and B given an input image C is much more challenging (Figure 10.40). 


To see why, observe that while the composite pixel color C provides three measurements, 
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Figure 10.40 Natural image matting (Chuang, Curless et al. 2001) O 2001 IEEE: (a) input 


image with a “natural” (non-constant) background; (b) hand-drawn trimap—gray indicates 


unknown regions; (c) extracted alpha map; (d) extracted (premultiplied) foreground colors; 


(e) composite over a new background. 


the F, a, and B unknowns have a total of seven degrees of freedom. Devising techniques to 
estimate these unknowns despite the underconstrained nature of the problem is the essence of 
image matting. 

In this section, we review a number of image matting techniques. We begin with blue 
screen matting, which assumes that the background is a constant known color, and discuss its 
variants, two-screen matting (when multiple backgrounds can be used) and difference matting 
(where the known background is arbitrary). We then discuss local variants of natural image 
matting, where both the foreground and background are unknown. In these applications, it is 
usual to first specify a trimap, i.e., a three-way labeling of the image into foreground, back- 
ground, and unknown regions (Figure 10.40b). Next, we present some global optimization 
approaches to natural image matting. Finally, we discuss variants on the matting problem, 
including shadow matting, flash matting, and environment matting. 


10.4.1 Blue screen matting 


Blue screen matting involves filming an actor (or object) in front of a constant colored back- 
ground. While originally bright blue was the preferred color, bright green is now more com- 
monly used (Wright 2006; Brinkmann 2008). Smith and Blinn (1996) discuss a number of 
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techniques for blue screen matting, which are mostly described in patents rather than in the 
open research literature. Early techniques used linear combinations of object color channels 
with user-tuned parameters to estimate the opacity a. 

Chuang, Curless et al. (2001) describe a newer technique called Mishima’s algorithm, 
which involves fitting two polyhedral surfaces (centered at the mean background color), sep- 
arating the foreground and background color distributions, and then measuring the relative 
distance of a novel color to these surfaces to estimate a (Figure 10.41e). While this technique 
works well in many studio settings, 1t can still suffer from blue spill, where translucent pixels 


around the edges of an object acquire some of the background blue coloration. 


Two-screen matting. In their paper, Smith and Blinn (1996) also introduce an algorithm 
called triangulation matting that uses more than one known background color to over-constrain 
the equations required to estimate the opacity a and foreground color F. 

For example, consider in the compositing equation (10.29) setting the background color 
to black, i.e., B = 0. The resulting composite image C is therefore equal to aF. Replacing 


the background color with a different known non-zero value B now results in 
C -aF = (1 —- a)B, (10.30) 


which is an overconstrained set of (color) equations for estimating a. In practice, B should 
be chosen so as not to saturate C' and, for best accuracy, several values of B should be used. 
It is also important that colors be linearized before processing, which is the case for all image 
matting algorithms. Papers that generate ground truth alpha mattes for evaluation purposes 
normally use these techniques to obtain accurate matte estimates (Chuang, Curless ef al. 
2001; Wang and Cohen 2007a; Levin, Acha, and Lischinski 2008; Rhemann, Rother et al. 
2008, 2009).”> Exercise 10.8 has you do this as well. 


Difference matting. A related approach when the background is irregular but known is 
called difference matting (Wright 2006; Brinkmann 2008). It is most commonly used when 
the actor or object is filmed against a static background, e.g., for office video conferencing, 
person tracking applications (Toyama, Krumm et al. 1999), or to produce silhouettes for vol- 
umetric 3D reconstruction techniques (Section 12.7.3) (Szeliski 1993; Seitz and Dyer 1997; 
Seitz, Curless ef al. 2006). It can also be used with a panning camera where the background 
is composited from frames where the foreground has been removed using a garbage matte 


(Section 10.4.5) (Chuang, Agarwala et al. 2002). Another application is the detection of vi- 


25 See the alpha matting evaluation website at http://alphamatting.com. 
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sual continuity errors in films, i.e., differences in the background when a shot is re-taken at a 
later time (Pickup and Zisserman 2009). 

In the case where the foreground and background motions can both be specified with 
parametric transforms, high-quality mattes can be extracted using a generalization of triangu- 
lation matting (Wexler, Fitzgibbon, and Zisserman 2002). When frames need to be processed 
independently, however, the results are often of poor quality (Figure 10.42). In such cases, 
using a pair of stereo cameras as input can dramatically improve the quality of the results 
(Criminisi, Cross et al. 2006; Yin, Criminisi et al. 2007). 


10.4.2 Natural image matting 


The most general version of image matting is when nothing is known about the background 
except, perhaps, for a rough segmentation of the scene into foreground, background, and 
unknown regions, which is known as the trimap (Figure 10.40b). Some techniques, however, 
relax this requirement and allow the user to just draw a few strokes or scribbles in the image: 
see Figures 10.45 and 10.46 (Wang and Cohen 2005; Wang, Agrawala, and Cohen 2007; 
Levin, Lischinski, and Weiss 2008; Rhemann, Rother et al. 2008; Rhemann, Rother, and 
Gelautz 2008). Fully automated single image matting results have also been reported (Levin, 
Acha, and Lischinski 2008; Singaraju, Rother, and Rhemann 2009). The survey paper by 
Wang and Cohen (2009) has detailed descriptions and comparisons of all of these techniques, 
a selection of which are described briefly below, while the website http://alphamatting.com 
has up-to-date lists and numerical comparisons of the most recent algorithms. 

A relatively simple algorithm for performing natural image matting is Knockout, as de- 
scribed by Chuang, Curless et al. (2001) and illustrated in Figure 10.41f. In this algorithm, 
the nearest known foreground and background pixels (in image space) are determined and 
then blended with neighboring known pixels to produce a per-pixel foreground F and back- 
ground B color estimate. The background color is then adjusted so that the measured color C 
lies on the line between F and B. Finally, opacity a is estimated on a per-channel basis, and 
the three estimates are combined based on per-channel color differences. (This is an approx- 
imation to the least squares solution for a.) Figure 10.42 shows that Knockout has problems 
when the background consists of more than one dominant local color. 

More accurate matting results can be obtained if we treat the foreground and background 
colors as distributions sampled over some region (Figure 10.41g-h). Ruzon and Tomasi 
(2000) model local color distributions as mixtures of (uncorrelated) Gaussians and compute 
these models in strips. They then find the pairing of mixture components F and B that best 
describes the observed color C, compute the a as the relative distance between these means, 


and adjust the estimates of F and B so that they are collinear with C. 
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Background Background 


Mishima Knockout Ruzon—Tomasi Bayesian 


Figure 10.41 Image matting algorithms (Chuang, Curless et al. 2001) O 2001 IEEE. 
Mishima’s algorithm models global foreground and background color distribution as polyhe- 
dral surfaces centered around the mean background (blue) color. Knockout uses a local color 
estimate of foreground and background for each pixel and computes a along each color axis. 
Ruzon and Tomasi's algorithm locally models foreground and background colors and vari- 
ances. Chuang et al.’s Bayesian matting approach computes a MAP estimate of (fractional) 


foreground color and opacity given the local foreground and background distributions. 


Chuang, Curless et al. (2001) and Hillman, Hannah, and Renshaw (2001) use full 3 x 3 
color covariance matrices to model mixtures of correlated Gaussians, and compute estimates 
independently for each pixel. Matte extraction proceeds in strips starting from known color 
values growing into the unknown regions, so that recently computed F and B colors can be 
used in later stages. 

To estimate the most likely value of an unknown pixel’s opacity and (unmixed) foreground 


and background colors, Chuang et al. use a fully Bayesian formulation that maximizes 

P(F, B,alC) = P(C|F, B,a)P(F)P(B)P(a)/P(C). (10.31) 
This is equivalent to minimizing the negative log likelihood 

L(F, B,alC) = L(C|F, B,a) + L(F) + L(B) + L(a) (10.32) 


(dropping the L(C) term because it is constant). 
Let us examine each of these terms in turn. The first, L(C|F, B, a), is the likelihood that 


pixel color C was observed given values for the unknowns (F, B, a). If we assume Gaussian 
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Figure 10.42 Natural image matting results (Chuang, Curless et al. 2001) © 2001 IEEE. 
Difference matting and Knockout both perform poorly on this kind of background, while the 
newer natural image matting techniques perform well. Chuang et al.’s results are slightly 


smoother and closer to the ground truth. 
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noise in our observation with variance 2,, this negative log likelihood (data term) is 
L(C) = Al|C — [aF + (1- a)B]1P/0%, (10.33) 


as illustrated in Figure 10.41h. 

The second term, L(F), corresponds to the likelihood that a particular foreground color 
F comes from the Gaussian mixture model. After partitioning the sample foreground colors 
into clusters, a weighted mean F and covariance © are computed, where the weights are 
proportional to a given foreground pixel’s opacity and distance from the unknown pixel.? 


The negative log likelihood for each cluster is thus given by 
L(F) = (F — F) Xp (F - F). (10.34) 


A similar method is used to estimate unknown background color distributions. If the back- 
ground is already known, i.e., for blue screen or difference matting applications, its measured 
color value and variance are used instead. 

An alternative to modeling the foreground and background color distributions as mixtures 
of Gaussians is to keep around the original color samples and to compute the most likely 
pairings that explain the observed color C (Wang and Cohen 2005, 2007a). These techniques 
are described in more detail in (Wang and Cohen 2009). 

In their Bayesian matting paper, Chuang, Curless et al. (2001) assume a constant (non- 
informative) distribution for L(a). Follow-on papers assume this distribution to be more 
peaked around 0 and 1, or sometimes use Markov random fields (MRFs) to define a global 
correlated prior on P(a) (Wang and Cohen 2009). 

To compute the most likely estimates for (F, B, a), the Bayesian matting algorithm alter- 
nates between computing (F, B) and a, as each of these problems is quadratic and hence can 
be solved as a small linear system. When several color clusters are estimated, the most likely 
pairing of foreground and background color clusters is used. 

Bayesian image matting produces results that improve on the original natural image mat- 
ting algorithm by Ruzon and Tomasi (2000), as can be seen in Figure 10.42. However, com- 
pared to later techniques (Wang and Cohen 2009), its performance is not as good for complex 


backgrounds or inaccurate trimaps (Figure 10.44). 


10.4.3 Optimization-based matting 


An alternative to estimating each pixel’s opacity and foreground color independently is to use 


global optimization to compute a matte that takes into account correlations between neigh- 


6Note that in this whole chapter, we mostly use upper-case italics to denote images or pixel values, even when 


they are color vectors. The covariance 2 p is a3 x 3 matrix for each foreground cluster. 
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Figure 10.43 Color line matting (Levin, Lischinski, and Weiss 2008): (a) local 3 x 3 patch 
of colors; (b) potential assignment of a values; (c) foreground and background color lines, 
the vector ay joining their closest points of intersection, and the family of parallel planes of 
constant a values, œ; = ag : (C; — Bo); (d) a scatter plot of sample colors and the deviations 


from the mean up for two sample colors C; and C;. 


boring a values. Two examples of this are border matting in the GrabCut interactive seg- 
mentation system (Rother, Kolmogorov, and Blake 2004) and Poisson Matting (Sun, Jia et al. 
2004). 

Border matting first dilates the region around the binary segmentation produced by Grab- 
Cut (Section 4.3.2) and then solves for a sub-pixel boundary location A and a blur width c 
for every point along the boundary (Figure 10.39). Smoothness in these parameters along the 
boundary is enforced using regularization and the optimization is performed using dynamic 
programming. While this technique can obtain good results for smooth boundaries, such as a 
person’s face, it has difficulty with fine details, such as hair. 

Poisson matting (Sun, Jia et al. 2004) assumes a known foreground and background color 
for each pixel in the trimap (as with Bayesian matting). However, instead of independently 
estimating each a value, it assumes that the gradient of the alpha matte and the gradient of 
the color image are related by 

F-B 

Va = |F BJ? VC, (10.35) 
which can be derived by taking gradients of both sides of (10.29) and assuming that the 
foreground and background vary slowly. The per-pixel gradient estimates are then integrated 
into a continuous a(x) field using the regularization (least squares) technique first described 
in Section 4.2 (4.24) and subsequently used in Poisson blending (Section 8.4.4, Equation 
(8.75)) and gradient-based dynamic range compression mapping (Section 10.2.1, Equation 
(10.18)). This technique works well when good foreground and background color estimates 


are available and these colors vary slowly. 
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Instead of computing per-pixel foreground and background colors, Levin, Lischinski, and 
Weiss (2008) assume only that these color distributions can locally be well approximated as 
mixtures of two colors, which is known as the color line model (Figure 10.43a—c). Under this 
assumption, a closed-form estimate for a at each pixel 7 in a (say, 3 x 3) window W is given 
by 

a; = az: (C; — Bo) = az: C + bx, (10.36) 


where C; is the pixel color treated as a three-vector, Bo is any pixel along the background 
color line, and a; is the vector joining the two closest points on the foreground and back- 
ground color lines, as shown in Figure 10.43c. (Note that the geometric derivation shown 
in this figure is an alternative to the algebraic derivation presented by Levin, Lischinski, and 
Weiss (2008).) Minimizing the deviations of the alpha values a; from their respective color 


line models (10.36) over all overlapping windows Wọ in the image gives rise to the cost 


Ea = Y (> (i-ar: Ci =b? e], (10.37) 


k NiEWk 


where the e term is used to regularize the value of a; in the case where the two color distri- 
butions overlap (i.e., in constant a regions). 
Because this formula is quadratic in the unknowns ((az,bx)), they can be eliminated 


inside each window Wọ, leading to a final energy 
E, =0 La, (10.38) 


where the entries in the L matrix are given by 


1 


a M 


(as — (1 + (Ci — u) E (C; — m))) ' (10.39) 
kiCGwrAJECWR 


where M = |W,,| is the number of pixels in each (overlapping) window, ju; is the mean color 
of the pixels in window Wọ, and Y, is the 3 x 3 covariance of the pixel colors plus €/)/1. 

Figure 10.43d shows the intuition behind the entries in this affinity matrix, which is called 
the matting Laplacian. Note how when two pixels C; and C; in W;, point in opposite direc- 
tions away from the mean juz, their weighted dot product is close to —1, and so their affinity 
becomes close to 0. Pixels close to each other in color space (and hence with similar expected 
a values) will have affinities close to —2/M. 

Minimizing the quadratic energy (10.38) constrained by the known values of a = {0,1} 
at scribbles only requires the solution of a sparse set of linear equations, which is why the 


authors call their technique a closed-form solution to natural image matting. Once a has 
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Random walk 


EasyMatting WEY Closed-form 


Figure 10.44 Comparative matting results for a medium accuracy trimap. Wang and Co- 
hen (2009) describe the individual techniques being compared. 


Original & User Input Closed-form Robust 


Geodesic Geodesic+Robust Ground-truth 


í 


Figure 10.45 Comparative matting results with scribble-based inputs. Wang and Cohen 
(2009) describe the individual techniques being compared. 


been computed, the foreground and background colors are estimated using a least squares 
minimization of the compositing equation (10.29) regularized with a spatially varying first- 
order smoothness, 


E =D [Ci — [a + F; + (1 — a) Bal? +A VaV F:N? + VB’), (10.40) 


where the |Va;| weight is applied separately for the x and y components of the F and B 
derivatives (Levin, Lischinski, and Weiss 2008). 

Laplacian (closed-form) matting is just one of many optimization-based techniques sur- 
veyed and compared by Wang and Cohen (2009). Some of these techniques use alternative 


formulations for the affinities or smoothness terms on the a matte, alternative estimation 
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Figure 10.46  Stroke-based segmentation result (Rhemann, Rother et al. 2008) O 2008 
IEEE. 


techniques such as belief propagation, or alternative representations (e.g., local histograms) 
for modeling local foreground and background color distributions (Wang and Cohen 2005, 
2007a,b). Some of these techniques also provide real-time results as the user draws a contour 
line or sparse set of scribbles (Wang, Agrawala, and Cohen 2007; Rhemann, Rother ef al. 
2008) or even pre-segment the image into a small number of mattes that the user can select 
with simple clicks (Levin, Acha, and Lischinski 2008). 

Figure 10.44 shows the results of running a number of the surveyed algorithms on a 
region of toy animal fur where a trimap has been specified, while Figure 10.45 shows results 
for techniques that can produce mattes with only a few scribbles as input. Figure 10.46 
shows a result for an even more recent algorithm (Rhemann, Rother et al. 2008) that claims 
to outperform all of the techniques surveyed by Wang and Cohen (2009). 

The latest results on natural image matting can be found on the http://alphamatting.com 
website created by Rhemann, Rother et al. (2009). It currently lists over 60 different algo- 
rithms, with most of the more recent algorithms using deep neural networks. The Deep Image 
Matting paper by Xu, Price et al. (2017) provides a larger database of 49,300 training images 
and 1,000 test images constructed by overlaying manually created color foreground mattes 


over a variety of backgrounds.?” 


Pasting. Once a matte has been pulled from an image, it is usually composited directly 
over the new background, unless the seams between the cutout and background regions are 
to be hidden, in which case Poisson blending (Pérez, Gangnet, and Blake 2003) can be used 
(Section 8.4.4). 

In the latter case, it is helpful if the matte boundary passes through regions that either 
have little texture or look similar in the old and new images. Papers by Jia, Sun et al. (2006) 
and Wang and Cohen (2007b) explain how to do this. 


2 https://sites.google.com/view/deepimagematting 
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(b) (c) (d) 


Figure 10.47 Smoke matting (Chuang, Agarwala et al. 2002) © 2002 ACM: (a) input video 
frame; (b) after removing the foreground object; (c) estimated alpha matte; (d) insertion of 


new objects into the background. 


10.4.4 Smoke, shadow, and flash matting 


In addition to matting out solid objects with fractional boundaries, it is also possible to matte 
out translucent media such as smoke (Chuang, Agarwala et al. 2002). Starting with a video 
sequence, each pixel is modeled as a linear combination of its (unknown) background color 
and a constant foreground (smoke) color that is common to all pixels. Voting in color space 
is used to estimate this foreground color and the distance along each color line is used to 
estimate the per-pixel temporally varying alpha (Figure 10.47). 

Extracting and re-inserting shadows is also possible using a related technique (Chuang, 
Goldman et al. 2003; Wang, Curless, and Seitz 2020). Here, instead of assuming a constant 
foreground color, each pixel is assumed to vary between its fully lit and fully shadowed col- 
ors, which can be estimated by taking (robust) minimum and maximum values over time as 
a shadow passes over the scene (Exercise 10.9). The resulting fractional shadow matte can 
be used to re-project the shadow into a new scene. If the destination scene has a non-planar 
geometry, it can be scanned by waving a straight stick shadow across the scene. The new 
shadow matte can then be warped with the computed deformation field to have it drape cor- 
rectly over the new scene (Figure 10.48). Shadows can also be extracted from video streams 
by extending video object segmentation algorithms (Section 9.4.3) to include shadows and 
other effects such as smoke (Lu, Cole et al. 2021). An example of useful shadow manipula- 
tion in photographs is the removal or softening of harsh shadows in people’s portraits (Sun, 
Barron et al. 2019; Zhou, Hadap et al. 2019; Zhang, Barron et al. 2020), which is available 
as the Portrait Light feature in Google Photos.? 

The quality and reliability of matting algorithms can also be enhanced using more sophis- 
ticated acquisition systems. For example, taking a flash and non-flash image pair supports 


the reliable extraction of foreground mattes, which show up as regions of large illumination 


3 https://blog.google/products/photos/new-helpful- editor 
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(a) Foreground scene (b) Background scene (c) Blue screen composite (d) Our method (e) Reference photograph 


Figure 10.48 Shadow matting (Chuang, Goldman et al. 2003) © 2003 ACM. Instead of 
simply darkening the new scene with the shadow (c), shadow matting correctly dims the lit 


scene with the new shadow and drapes the shadow over 3D geometry (d). 


change between the two images (Sun, Li ef al. 2006). Taking simultaneous video streams 
focused at different distances (McGuire, Matusik et al. 2005) or using multi-camera arrays 
(Joshi, Matusik, and Avidan 2006) are also good approaches to producing high-quality mat- 
tes. These techniques are described in more detail in (Wang and Cohen 2009). 

Lastly, photographing a refractive object in front of a number of patterned backgrounds al- 
lows the object to be placed in novel 3D environments. These environment matting techniques 
(Zongker, Werner et al. 1999; Chuang, Zongker et al. 2000) are discussed in Section 14.4. 


10.4.5 Video matting 


While regular single-frame matting techniques such as blue or green screen matting (Smith 
and Blinn 1996; Wright 2006; Brinkmann 2008) can be applied to video sequences, the pres- 
ence of moving objects can sometimes make the matting process easier, as portions of the 
background may get revealed in preceding or subsequent frames. 

Chuang, Agarwala et al. (2002) describe a nice approach to this video matting problem, 
where foreground objects are first removed using a conservative garbage matte and the re- 
sulting background plates are aligned and composited to yield a high-quality background 
estimate. They also describe how trimaps drawn at sparse keyframes can be interpolated to 
in-between frames using bi-direction optical flow. Alternative approaches to video matting, 
such as rotoscoping, which involves drawing curves or strokes in video sequence keyframes 
(Agarwala, Hertzmann et al. 2004; Wang, Bhat et al. 2005), are discussed in the matting 
survey paper by Wang and Cohen (2009). There is also a newer dataset of carefully matted 
stop-motion animation videos created by Erofeev, Gitman et al. (2015). 

Since the original development of video matting techniques, improved algorithms have 
been developed for both interactive and fully automated video object segmentation, as dis- 


https://videomatting.com 
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(a) (b) (c) 


Figure 10.49 Texture synthesis: (a) given a small patch of texture, the task is to synthesize 
(b) a similar-looking larger patch; (c) other semi-structured textures that are challenging to 
synthesize. (Images courtesy of Alyosha Efros.) 


cussed in Section 9.4.3. The paper by Sengupta, Jayaram et al. (2020) uses deep learning and 
adversarial loss, as well as a motion prior, to provide high-quality mattes from small-motion 
handheld videos where a clean plate of the background has also been captured. Wang, Cur- 
less, and Seitz (2020) describe a system where shadows and occlusions can be determined 
by observing people walking around a scene, enabling the insertion of new people at correct 
scales and lighting. In follow-up work Lin, Ryabtsev et al. (2021) describe a high-resolution 
real-time video matting system along with two new video and image matting datasets. Fi- 
nally, Lu, Cole et al. (2021) describe how to extract shadows, reflections, and other effects 
associated with objects being tracked and segmented in videos. 


10.5 Texture analysis and synthesis 


While texture analysis and synthesis may not at first seem like computational photography 
techniques, they are, in fact, widely used to repair defects, such as small holes, in images or 
to create non-photorealistic painterly renderings from regular photographs. 

The problem of texture synthesis can be formulated as follows: given a small sample of 
a “texture” (Figure 10.49a), generate a larger similar-looking image (Figure 10.49b). As you 
can imagine, for certain sample textures, this problem can be quite challenging. 

Traditional approaches to texture analysis and synthesis try to match the spectrum of the 
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source image while generating shaped noise. Matching the frequency characteristics, which 
1s equivalent to matching spatial correlations, is in itself not sufficient. The distributions of 
the responses at different frequencies must also match. Heeger and Bergen (1995) develop an 
algorithm that alternates between matching the histograms of multi-scale (steerable pyramid) 
responses and matching the final image histogram. Portilla and Simoncelli (2000) improve 
on this technique by also matching pairwise statistics across scale and orientations. De Bonet 
(1997) uses a coarse-to-fine strategy to find locations in the source texture with a similar par- 
ent structure, 1.e., similar multi-scale oriented filter responses, and then randomly chooses 
one of these matching locations as the current sample value. Gatys, Ecker, and Bethge (2015) 
also use a pyramidal fine-to-coarse-to-fine algorithm, but using deep networks trained for ob- 
ject recognition. At each level in the deep network, they gather correlation statistics between 
various features. During generation, they iteratively update the random image until these 
more perceptually motivated statistic (Zhang, Isola et al. 2018) are matched. We give more 
details on this and other neural approaches to texture synthesis, such as Shaham, Dekel, and 


Michaeli (2019), in Section 10.5.3 on neural style transfer. 


Exemplar-based texture synthesis algorithms sequentially generate texture pixels by look- 
ing for neighborhoods in the source texture that are similar to the currently synthesized image 
(Efros and Leung 1999). Consider the (as yet) unknown pixel p in the partially constructed 
texture on the left side of Figure 10.50. As some of its neighboring pixels have been already 
been synthesized, we can look for similar partial neighborhoods in the sample texture image 
on the right and randomly select one of these as the new value of p. This process can be 
repeated down the new image either in a raster fashion or by scanning around the periphery 
(“onion peeling”) when filling holes, as discussed in (Section 10.5.1). In their actual imple- 
mentation, Efros and Leung (1999) find the most similar neighborhood and then include all 
other neighborhoods within a d = (1+ €) distance, with e = 0.1. They also optionally weight 


the random pixel selections by the similarity metric d. 


To accelerate this process and improve its visual quality, Wei and Levoy (2000) extend 
this technique using a coarse-to-fine generation process, where coarser levels of the pyramid, 
which have already been synthesized, are also considered during the matching (De Bonet 
1997). To accelerate the nearest neighbor finding, tree-structured vector quantization is used. 
A much faster version of such nearest neighbor search is the widely used randomized Patch- 


Match iterative update algorithm developed by Barnes, Shechtman et al. (2009). 


Efros and Freeman (2001) propose an alternative acceleration and visual quality improve- 
ment technique. Instead of synthesizing a single pixel at a time, overlapping square blocks are 
selected using similarity with previously synthesized regions (Figure 10.51). Once the appro- 


priate blocks have been selected, the seam between newly overlapping blocks is determined 
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Output image Input image 


Figure 10.50 Texture synthesis using non-parametric sampling (Efros and Leung 1999). 
The value of the newest pixel p is randomly chosen from similar local (partial) patches in the 


source texture (input image). (Figure courtesy of Alyosha Efros.) 
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Figure 10.51 Texture synthesis by image quilting (Efros and Freeman 2001). Instead of 
generating a single pixel at a time, larger blocks are copied from the source texture. The 
transitions in the overlap regions between the selected blocks are then optimized using dy- 


namic programming. (Figure courtesy of Alyosha Efros.) 


using dynamic programming. (Full graph cut seam selection is not required, because only 
one seam location per row is needed for a vertical boundary.) Because this process involves 
selecting small patches and them stitching them together, Efros and Freeman (2001) call their 
system image quilting. Komodakis and Tziritas (2007) present an MRF-based version of this 
block synthesis algorithm that uses a new, efficient version of loopy belief propagation they 
call “Priority-BP”. Wei, Lefebvre ef al. (2009) present a comprehensive survey of work in 


exemplar-based texture synthesis through 2009. 


10.5.1 Application: Hole filling and inpainting 


Filling holes left behind when objects or defects are excised from photographs, which is 


known as inpainting, is one of the most common applications of texture synthesis. Such 
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Figure 10.52 Image inpainting (hole filling): (a-b) propagation along isophote direc- 
tions (Bertalmio, Sapiro et al. 2000) O 2000 ACM; (c-d) exemplar-based inpainting with 
confidence-based filling order (Criminisi, Pérez, and Toyama 2004). 


techniques are used not only to remove unwanted people or interlopers from photographs 
(King 1997) but also to fix small defects in old photos and movies (scratch removal) or to 
remove wires holding props or actors in mid-air during filming (wire removal). Bertalmio, 
Sapiro et al. (2000) solve the problem by propagating pixel values along isophote (constant- 
value) directions interleaved with some anisotropic diffusion steps (Figure 10.52a—b). Telea 
(2004) develops a faster technique that uses the fast marching method from level sets (Sec- 
tion 7.3.2). However, these techniques will not hallucinate texture in the missing regions. 
Bertalmio, Vese et al. (2003) augment their earlier technique by adding synthetic texture to 
the infilled regions. 

The example-based (non-parametric) texture generation techniques discussed in the pre- 
vious section can also be used by filling the holes from the outside in (the “onion-peel” or- 
dering). However, this approach may fail to propagate strong oriented structures. Criminisi, 
Pérez, and Toyama (2004) use exemplar-based texture synthesis where the order of synthesis 
is determined by the strength of the gradient along the region boundary (Figures 10.1d and 
10.52c-d). Sun, Yuan et al. (2004) present a related approach where the user draws interac- 
tive lines to indicate where structures should be preferentially propagated. Additional tech- 
niques related to these approaches include those developed by Drori, Cohen-Or, and Yeshurun 
(2003), Kwatra, Schódl et al. (2003), Kwatra, Essa et al. (2005), Wilczkowiak, Brostow et al. 
(2005), Komodakis and Tziritas (2007), and Wexler, Shechtman, and Irani (2007). 

Most hole filling algorithms borrow small pieces of the original image to fill in the holes. 
When a large database of source images is available, e.g., when images are taken from a 
photo sharing site or the internet, it is sometimes possible to copy a single contiguous image 
region to fill the hole. Hays and Efros (2007) present such a technique, which uses image 
context and boundary compatibility to select the source image, which is then blended with 


the original (holey) image using graph cuts and Poisson blending. This technique is discussed 
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in more detail in Section 6.4.4 and Figure 6.40. 

As with other areas of image processing, deep neural networks are used in all of the latest 
techniques (Yang, Lu et al. 2017; Yu, Lin et al. 2018; Liu, Reda et al. 2018; Zeng, Fu et al. 
2019; Yu, Lin et al. 2019; Chang, Liu et al. 2019; Nazeri, Ng et al. 2019; Ren, Yu et al. 2019; 
Shih, Su et al. 2020; Yi, Tang et al. 2020). Some of these papers have introduced interesting 
new extensions to neural network architectures, such as partial convolutions (Liu, Reda et 
al. 2018) and partial convolutions (Yu, Lin et al. 2019), the propagation of edge structures 
(Nazeri, Ng et al. 2019; Ren, Yu et al. 2019), multi-resolution attention and residuals (Yi, 
Tang et al. 2020), and iterative confidence feedback (Zeng, Lin et al. 2020). Inpainting 
has also been applied to video sequences (e.g., Gao, Saraf et al. 2020). Results on recent 
challenges on image inpainting can be found in the AIM 2020 Workshop and Challenges on 
this topic (Ntavelis, Romero et al. 2020a). 


10.5.2 Application: Non-photorealistic rendering 


Two more applications of the exemplar-based texture synthesis ideas are texture transfer 
(Efros and Freeman 2001) and image analogies (Hertzmann, Jacobs et al. 2001), which are 
both examples of non-photorealistic rendering (Gooch and Gooch 2001). 

In addition to using a source texture image, texture transfer also takes a reference (or 
target) image, and tries to match certain characteristics of the target image with the newly 
synthesized image. For example, the new image being rendered in Figure 10.53c not only 
tries to satisfy the usual similarity constraints with the source texture in Figure 10.53b, but it 
also tries to match the luminance characteristics of the reference image. Efros and Freeman 
(2001) mention that blurred image intensities or local image orientation angles are alternative 
quantities that could be matched. 


Hertzmann, Jacobs et al. (2001) formulate the following problem: 


Given a pair of images A and A’ (the unfiltered and filtered source images, re- 
spectively), along with some additional unfiltered target image B, synthesize a 


new filtered target image B’ such that 
A: =B: B. 


Instead of having the user program a certain non-photorealistic rendering effect, it is sufficient 
to supply the system with examples of before and after images, and let the system synthesize 
the novel image using exemplar-based synthesis, as shown in Figure 10.54. 

The algorithm used to solve image analogies proceeds in a manner analogous to the tex- 
ture synthesis algorithms of Efros and Leung (1999) and Wei and Levoy (2000). Once Gaus- 
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(a) (b) (c) 


Figure 10.53 Texture transfer (Efros and Freeman 2001) O 2001 ACM: (a) reference (tar- 


get) image; (b) source texture; (c) image (partially) rendered using the texture. 


A A’ 


B' 


Figure 10.54 Image analogies (Hertzmann, Jacobs et al. 2001) © 2001 ACM. Given an 
example pair of a source image A and its rendered (filtered) version A’, generate the rendered 


version B' from another unfiltered source image B. 


sian pyramids have been computed for all of the source and reference images, the algorithm 
looks for neighborhoods in the source filtered pyramids generated from 4” that are simi- 
lar to the partially constructed neighborhood in B”, while at the same time having similar 
multi-resolution appearances at corresponding locations in A and B. As with texture trans- 
fer, appearance characteristics can include not only (blurred) color or luminance values but 
also orientations. 

This general framework allows image analogies to be applied to a variety of rendering 
tasks. In addition to exemplar-based non-photorealistic rendering, image analogies can be 
used for traditional texture synthesis, super-resolution, and texture transfer (using the same 
textured image for both A and 4”). If only the filtered (rendered) image A” is available, as 
is the case with paintings, the missing reference image A can be hallucinated using a smart 
(edge preserving) blur operator. Finally, it is possible to train a system to perform texture-by- 
numbers by manually painting over a natural image with pseudocolors corresponding to pix- 
els’ semantic meanings, e.g., water, trees, and grass (Figure 10.55a—b). The resulting system 
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Le 


Original A’ Painted A Novel painted B Novel textured B’ 


Figure 10.55 = Texture-by-numbers (Hertzmann, Jacobs et al. 2001) © 2001 ACM. Given a 
textured image A’ and a hand-labeled (painted) version A, synthesize a new image B' given 


just the painted version B. 


can then convert a novel sketch into a fully rendered synthetic photograph (Figure 10.55c-d). 
In more recent work, Cheng, Vishwanathan, and Zhang (2008) add ideas from image quilting 
(Efros and Freeman 2001) and MRF inference (Komodakis, Tziritas, and Paragios 2008) to 
the basic image analogies algorithm, while Ramanarayanan and Bala (2007) recast this pro- 
cess as energy minimization, which means it can be viewed as a conditional random field 
(Section 4.3.1), and devise an efficient algorithm to find a good minimum. 

More traditional filtering and feature detection techniques can also be used for non- 
photorealistic rendering.*% For example, pen-and-ink illustration (Winkenbach and Salesin 
1994) and painterly rendering techniques (Litwinowicz 1997) use local color, intensity, and 
orientation estimates as an input to their procedural rendering algorithms. Techniques for 
stylizing and simplifying photographs and video (DeCarlo and Santella 2002; Winnemóller, 
Olsen, and Gooch 2006; Farbman, Fattal et al. 2008), as in Figure 10.56, use combinations of 
edge-preserving blurring (Section 3.3.1) and edge detection and enhancement (Section 7.2.3). 


10.5.3 Neural style transfer and semantic image synthesis 


With the advent of deep learning, image-guided exemplar-based texture synthesis has mostly 
been replaced with statistics matching in deep networks (Gatys, Ecker, and Bethge 2015). 
Figure 10.57 illustrates the basic idea used in neural style transfer networks. In the original 
work of Gatys, Ecker, and Bethge (2016), a style image ys and a content image y. (see 
Figure 10.58 for examples) are input to a loss network, which compares features derived 
from the style and target images with those derived from the image y being synthesized. 
These losses are normally a combination of a perceptual loss. The gradients of these losses 
are used to adjust the generated image y in an iterative fashion, which makes this process 


30For a good selection of papers, see the Symposia on Non-Photorealistic Animation and Rendering (NPAR). 
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(a) (b) 


Figure 10.56  Non-photorealistic abstraction of photographs: (a) (DeCarlo and Santella 
2002) O 2002 ACM and (b) (Farbman, Fattal et al. 2008) O 2008 ACM. 
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Figure 10.57 Network architecture for neural style transfer, which learns to transform 
images in one particular style (Johnson, Alahi, and Fei-Fei 2016) © 2016 Springer. During 
training, the content target image yc is fed into the image transformation network as an 
input x, along with a style image ys, and the network weights are updated so as to minimize 
the perceptual losses, i.e., the style reconstruction loss I styie and the feature reconstruction 
loss lfeat. The earlier network by Gatys, Ecker, and Bethge (2015) did not have an image 


transformation network, and instead used the losses to optimize the transformed image 4. 
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quite slow. 

To accelerate this, Johnson, Alahi, and Fei-Fei (2016) train a feedforward image trans- 
formation network with a fixed style image and many different content targets, adjusting the 
network weights so that the stylized image / resulting from a target y. matches the desired 
statistics. When a new image z is presented to be stylized, it is simply run through the image 
transformation network. Figure 10.58a shows some comparisons between Gatys, Ecker, and 
Bethge (2016) and Johnson, Alahi, and Fei-Fei (2016). 

Perceptual loss has now become a standard component of image synthesis systems (Doso- 
vitskiy and Brox 2016), often as an additional component to the generative adversarial loss 
(Section 5.5.4). They are also sometimes used as an alternative to older image quality metrics 
such as SSIM (Zhang, Isola et al. 2018; Talebi and Milanfar 2018; Tariq, Tursun et al. 2020; 
Czolbe, Krause et al. 2020). 

The basic architecture in Johnson, Alahi, and Fei-Fei (2016) was extended by Ulyanov, 
Vedaldi, and Lempitsky (2017), who show that using instance normalization instead of batch 
normalization significantly improves the results. Dumoulin, Shlens, and Kudlur (2017) and 
Huang and Belongie (2017) further extended these ideas to train one network to mimic dif- 
ferent styles, using conditional instance normalization and adaptive instance normalization 
to select among the pre-trained styles (or in-between blends), as shown in Figure 10.58b. 

Neural style transfer continues to be an actively studied area, with related approaches 
working on more generalized image-to-image translation (Isola, Zhu et al. 2017) and seman- 
tic photo synthesis (Chen and Koltun 2017; Park, Liu et al. 2019; Bau, Strobelt et al. 2019; 
Ntavelis, Romero et al. 2020b) applications—see Tewari, Fried et al. (2020, Section 6.1) 
for a recent survey. Most of the newer architectures use generative adversarial networks 
(GANs) (Kotovenko, Sanakoyeu et al. 2019; Shaham, Dekel, and Michaeli 2019; Yang, Wang 
et al. 2019; Svoboda, Anoosheh et al. 2020; Wang, Li et al. 2020; Xia, Zhang et al. 2020; 
Härkönen, Hertzmann et al. 2020), which we discussed in Section 5.5.4. There’s also a recent 


course on the more general topic of learning-based image synthesis (Zhu 2021). 


10.6 Additional reading 


Good overviews of the first decade of computational photography can be found in the book by 
Raskar and Tumblin (2010) and survey articles by Nayar (2006), Cohen and Szeliski (2006), 
Levoy (2006), Debevec (2006), and Hayes (2008), as well as two special journal issues edited 
by Bimber (2006) and Durand and Szeliski (2007). Notes from the courses on computational 


photography mentioned at the beginning of this chapter are another great source for more 
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Figure 10.58 Two examples of neural style transfer: (a) the pre-trained network of John- 
son, Alahi, and Fei-Fei (2016) © 2016 Springer (labeled “Ours” ) vs. (Gatys, Ecker, and 
Bethge 2016) (labeled “[11]”);, (b) a network that uses conditional instance normalization 
to mimic different styles (top row) applied to various content (left column) © (Dumoulin, 
Shlens, and Kudlur 2017). 
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recent material and references.*! 

The sub-field of high dynamic range imaging has its own book discussing research in this 
area (Reinhard, Heidrich et al. 2010), as well as some books describing related photographic 
techniques (Freeman 2008; Gulbins and Gulbins 2009). Algorithms for calibrating the radio- 
metric response function of a camera can be found in articles by Mann and Picard (1995), 
Debevec and Malik (1997), and Mitsunaga and Nayar (1999). 

The subject of tone mapping is treated extensively in (Reinhard, Heidrich et al. 2010). 
Representative papers from the large volume of literature on this topic include (Tumblin and 
Rushmeier 1993; Larson, Rushmeier, and Piatko 1997; Pattanaik, Ferwerda et al. 1998; Tum- 
blin and Turk 1999; Durand and Dorsey 2002; Fattal, Lischinski, and Werman 2002; Rein- 
hard, Stark et al. 2002; Lischinski, Farbman et al. 2006; Farbman, Fattal et al. 2008; Paris, 
Hasinoff, and Kautz 2011; Aubry, Paris et al. 2014). 

The literature on super-resolution is quite extensive (Chaudhuri 2001; Park, Park, and 
Kang 2003; Capel and Zisserman 2003; Capel 2004; van Ouwerkerk 2006). The term super- 
resolution usually describes techniques for aligning and merging multiple images to produce 
higher-resolution composites (Keren, Peleg, and Brada 1988; Irani and Peleg 1991; Cheese- 
man, Kanefsky et al. 1993; Mann and Picard 1994; Chiang and Boult 1996; Bascle, Blake, 
and Zisserman 1996; Capel and Zisserman 1998; Smelyanskiy, Cheeseman et al. 2000; Capel 
and Zisserman 2000; Pickup, Capel et al. 2009; Gulbins and Gulbins 2009; Hasinoff, Sharlet 
et al. 2016; Wronski, Garcia-Dorado et al. 2019). However, single-image super-resolution 
techniques have also been developed (Freeman, Jones, and Pasztor 2002; Baker and Kanade 
2002; Fattal 2007; Dong, Loy et al. 2016; Cai, Gu et al. 2019; Anwar, Khan, and Barnes 
2020). Such techniques are closely related to denoising (Zhang, Zuo et al. 2017; Brown 
2019; Liba, Murthy et al. 2019; Gu and Timofte 2019), deblurring and blind image deconvo- 
lution (Campisi and Egiazarian 2017; Zhang, Dai et al. 2019; Kupyn, Martyniuk et al. 2019), 
and demosaicing (Chatterjee, Joshi et al. 2011; Gharbi, Chaurasia ef al. 2016; Abdelhamed, 
Afifi et al. 2020). 

A good survey on image matting is given by Wang and Cohen (2009). Representative 
papers, which include extensive comparisons with previous work, include (Chuang, Curless 
et al. 2001; Wang and Cohen 2007a; Levin, Acha, and Lischinski 2008; Rhemann, Rother et 
al. 2008, 2009; Xu, Price et al. 2017). You can find pointers to recent papers and results on 
the http://alphamatting.com website created by Rhemann, Rother et al. (2009). Matting ideas 
can also be applied to manipulate shadows (Chuang, Goldman et al. 2003; Sun, Barron et al. 


3ICMU 15-463, http://graphics.cs.cmu.edu/courses/15-463/2008_fall, Berkeley CS194-26/294-26, https:// 
inst.eecs.berkeley.edu/~cs194-26/fa20, MIT 6.815/6.865, https://stellar.mit.edu/S/course/6/sp08/6.8 15/materials. 
html, Stanford CS 448A, https://graphics.stanford.edu/courses/cs448a-08-spring, CMU 16-726,  https:// 
learning-image-synthesis.github.io, and SIGGRAPH courses, https://web.media.mit.edu/~raskar/photo. 
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2019; Zhou, Hadap et al. 2019; Zhang, Barron et al. 2020; Wang, Curless, and Seitz 2020) 
and videos (Chuang, Agarwala et al. 2002; Wang, Bhat et al. 2005; Erofeev, Gitman et al. 
2015; Sengupta, Jayaram et al. 2020; Lin, Ryabtsev et al. 2021). 

The literature on texture synthesis and hole filling includes traditional approaches to tex- 
ture synthesis, which try to match image statistics between source and destination images 
(Heeger and Bergen 1995; De Bonet 1997; Portilla and Simoncelli 2000), as well as ap- 
proaches that search for matching neighborhoods or patches inside the source sample (Efros 
and Leung 1999; Wei and Levoy 2000; Efros and Freeman 2001; Wei, Lefebvre et al. 2009) 
or use neural networks (Gatys, Ecker, and Bethge 2015; Shaham, Dekel, and Michaeli 2019). 
In a similar vein, traditional approaches to hole filling involve the solution of local varia- 
tional (smooth continuation) problems (Bertalmio, Sapiro et al. 2000; Bertalmio, Vese et al. 
2003; Telea 2004). The next wave of techniques use data-driven texture synthesis approaches 
(Drori, Cohen-Or, and Yeshurun 2003; Kwatra, Schódl et al. 2003; Criminisi, Pérez, and 
Toyama 2004; Sun, Yuan ef al. 2004; Kwatra, Essa et al. 2005; Wilczkowiak, Brostow et al. 
2005; Komodakis and Tziritas 2007; Wexler, Shechtman, and Irani 2007). The most recent 
algorithms for image and video inpainting use deep neural networks (Yang, Lu et al. 2017; 
Yu, Lin et al. 2018; Liu, Reda et al. 2018; Shih, Su et al. 2020; Yi, Tang et al. 2020; Gao, 
Saraf et al. 2020; Ntavelis, Romero et al. 2020a). In addition to generating isolated patches of 
texture or inpainting missing region, related techniques can also be used to transfer the style 
of an image or painting to another one (Efros and Freeman 2001; Hertzmann, Jacobs et al. 
2001; Gatys, Ecker, and Bethge 2016; Johnson, Alahi, and Fei-Fei 2016; Dumoulin, Shlens, 
and Kudlur 2017; Huang and Belongie 2017; Shaham, Dekel, and Michaeli 2019). 


10.7 Exercises 


Ex 10.1: Radiometric calibration. Implement one of the multi-exposure radiometric cali- 
bration algorithms described in Section 10.2 (Debevec and Malik 1997; Mitsunaga and Nayar 
1999; Reinhard, Heidrich et al. 2010). This calibration will be useful in a number of different 
applications, such as stitching images or stereo matching with different exposures and shape 
from shading. 


1. Take a series of bracketed images with your camera on a tripod. If your camera has 
an automatic exposure bracketing (AEB) modes, taking three images may be sufficient 
to calibrate most of your camera’s dynamic range, especially if your scene has a lot of 
bright and dark regions. (Shooting outdoors or through a window on a sunny day is 
best.) 
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2. If your images are not taken on a tripod, first perform a global alignment. 
3. Estimate the radiometric response function using one of the techniques cited above. 


4. Estimate the high dynamic range radiance image by selecting or blending pixels from 
different exposures (Debevec and Malik 1997; Mitsunaga and Nayar 1999; Eden, Uyt- 
tendaele, and Szeliski 2006). 


5. Repeat your calibration experiments under different conditions, e.g., indoors under in- 
candescent light, to get a sense for the range of color balancing effects that your camera 


imposes. 


6. If your camera supports RAW and JPEG mode, calibrate both sets of images simulta- 
neously and to each other (the radiance at each pixel will correspond). See if you can 
come up with a model for what your camera does, e.g., whether it treats color balance 
as a diagonal or full 3 x 3 matrix multiply, whether it uses non-linearities in addition 


to gamma, whether it sharpens the image while “developing” the JPEG image, etc. 


7. Develop an interactive viewer to change the exposure of an image based on the average 
exposure of a region around the mouse. (One variant is to show the adjusted image 
inside a window around the mouse. Another is to adjust the complete image based on 
the mouse position.) 


8. Implement a tone mapping operator (Exercise 10.5) and use this to map your radiance 
image to a displayable gamut. 


Ex 10.2: Noise level function. Determine your camera’s noise level function using either 


multiple shots or by analyzing smooth regions. 


1. Set up your camera on a tripod looking at a calibration target or a static scene with a 
good variation in input levels and colors. (Check your camera’s histogram to ensure 
that all values are being sampled.) 


2. Take repeated images of the same scene (ideally with a remote shutter release) and 
average them to compute the variance at each pixel. Discarding pixels near high gra- 
dients (which are affected by camera motion), plot for each color channel the standard 


deviation at each pixel as a function of its output value. 


3. Fit a lower envelope to these measurements and use this as your noise level function. 
How much variation do you see in the noise as a function of input level? How much of 
this is significant, i.e., away from flat regions in your camera response function where 


you do not want to be sampling anyway? 
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4. (Optional) Using the same images, develop a technique that segments the image into 
near-constant regions (Liu, Szeliski et al. 2008). (This is easier if you are photograph- 
ing a calibration chart.) Compute the deviations for each region from a single image 
and use them to estimate the NLF. How does this compare to the multi-image tech- 


nique, and how stable are your estimates from image to image? 


Ex 10.3: Vignetting. Estimate the amount of vignetting in some of your lenses using one 


of the following three techniques (or devise one of your choosing): 


1. Take an image of a large uniform intensity region (well-illuminated wall or blue sky— 
but be careful of brightness gradients) and fit a radial polynomial curve to estimate the 


vignetting. 


2. Construct a center-weighted panorama and compare these pixel values to the input im- 
age values to estimate the vignetting function. Weight pixels in slowly varying regions 
more highly, as small misalignments will give large errors at high gradients. Option- 
ally estimate the radiometric response function as well (Litvinov and Schechner 2005; 
Goldman 2010). 


3. Analyze the radial gradients (especially in low-gradient regions) and fit the robust 
means of these gradients to the derivative of the vignetting function, as described by 
Zheng, Yu et al. (2008). 


For the parametric form of your vignetting function, you can either use a simple radial func- 
tion, e.g., 
f(r) = 1+aiır +aor? +- (10.41) 


or one of the specialized equations developed by Kang and Weiss (2000) and Zheng, Lin, and 
Kang (2006). 

In all of these cases, be sure that you are using linearized intensity measurements, by 
using either RAW images or images linearized through a radiometric response function, or at 
least images where the gamma curve has been removed. 

(Optional) What happens if you forget to undo the gamma before fitting a (multiplicative) 
vignetting function? 


Ex 10.4: Optical blur (PSF) estimation. Compute the optical PSF either using a known 
target (Figure 10.7) or by detecting and fitting step edges (Section 10.1.4) (Joshi, Szeliski, 
and Kriegman 2008; Cho, Paris et al. 2011). 


1. Detect strong edges to sub-pixel precision. 
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2. Fit a local profile to each oriented edge and fill these pixels into an ideal target image, 


either at image resolution or at a higher resolution (Figure 10.9c-d). 


3. Use least squares (10.1) at valid pixels to estimate the PSF kernel K, either globally or 
in locally overlapping sub-regions of the image. 


4. Visualize the recovered PSFs and use them to remove chromatic aberration or deblur 


the image. 


Ex 10.5: Tone mapping. Implement one of the tone mapping algorithms discussed in Sec- 
tion 10.2.1 (Durand and Dorsey 2002; Fattal, Lischinski, and Werman 2002; Reinhard, Stark 
et al. 2002; Lischinski, Farbman ef al. 2006) or any of the numerous additional algorithms 
discussed by Reinhard, Heidrich et al. (2010) and https://stellar.mit.edu/S/course/6/sp08/6. 
815/materials.html. 


(Optional) Compare your algorithm to local histogram equalization (Section 3.1.4). 


Ex 10.6: Flash enhancement. Develop an algorithm to combine flash and non-flash pho- 
tographs to best effect. You can use ideas from Eisemann and Durand (2004) and Petschnigg, 


Agrawala et al. (2004) or anything else you think might work well. 


Ex 10.7: Super-resolution. Implement one or more super-resolution algorithms and com- 


pare their performance. 


1. Take a set of photographs of the same scene using a hand-held camera (to ensure that 
there is some jitter between the photographs). 


2. Determine the PSF for the images you are trying to super-resolve using one of the 


techniques in Exercise 10.4. 


3. Alternatively, simulate a collection of lower-resolution images by taking a high-quality 
photograph (avoid those with compression artifacts) and applying your own prefilter 


kernel and downsampling. 


4. Estimate the relative motion between the images using a parametric translation and 


rotation motion estimation algorithm (Sections 8.1.3 or 9.2). 


5. Implement a basic least squares super-resolution algorithm by minimizing the differ- 
ence between the observed and downsampled images (10.26-10.27). 


6. Add in a gradient image prior, either as another least squares term or as a robust term 
that can be minimized using iteratively reweighted least squares (Appendix A.3). 
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7. (Optional) Implement one of the example-based super-resolution techniques, where 


matching against a set of exemplar images is used either to infer higher-frequency 
information to be added to the reconstruction (Freeman, Jones, and Pasztor 2002) 
or higher-frequency gradients to be matched in the super-resolved image (Baker and 
Kanade 2002). 


. (Optional) Use local edge statistic information to improve the quality of the super- 


resolved image (Fattal 2007). 


9. (Optional) Try some of the newest DNN-based super-resolution algorithms. 


Ex 10.8: Image matting. Develop an algorithm for pulling a foreground matte from natural 


images, as described in Section 10.4. 


1. 


Make sure that the images you are taking are linearized (Exercise 10.1 and Section 10.1) 
and that your camera exposure is fixed (full manual mode), at least when taking multi- 


ple shots of the same scene. 


. To acquire ground truth data, place your object in front of a computer monitor and 


display a variety of solid background colors as well as some natural imagery. 


. Remove your object and re-display the same images to acquire known background 


colors. 


. Use triangulation matting (Smith and Blinn 1996) to estimate the ground truth opacities 


a and pre-multiplied foreground colors af for your objects. 


. Implement one or more of the natural image matting algorithms described in Sec- 


tion 10.4 and compare your results to the ground truth values you computed. Alter- 


natively, use the matting test images published on http://alphamatting.com. 


. (Optional) Run your algorithms on other images taken with the same calibrated camera 


(or other images you find interesting). 


Ex 10.9: Smoke and shadow matting. Extract smoke or shadow mattes from one scene 


and insert them into another (Chuang, Agarwala et al. 2002; Chuang, Goldman et al. 2003). 


le 


Take a still or video sequence of images with and without some intermittent smoke and 
shadows. (Remember to linearize your images before proceeding with any computa- 
tions.) 


2. For each pixel, fit a line to the observed color values. 


10.7 Exercises 679 


3. If performing smoke matting, robustly compute the intersection of these lines to obtain 
the smoke color estimate. Then, estimate the background color as the other extremum 


(unless you have already taken a smoke-free background image). 


If performing shadow matting, compute robust shadow (minimum) and lit (maximum) 


values for each pixel. 


4. Extract the smoke or shadow mattes from each frame as the fraction between these two 


values (background and smoke or shadowed and lit). 
5. Scan a new (destination) scene or modify the original background with an image editor. 


6. Re-insert the smoke or shadow matte, along with any other foreground objects you may 
have extracted. 


7. (Optional) Using a series of cast stick shadows, estimate the deformation field for the 
destination scene to correctly warp (drape) the shadows across the new geometry. (This 
is related to the shadow scanning technique developed by Bouguet and Perona (1999) 
and implemented in Exercise 13.2.) 


8. (Optional) Chuang, Goldman et al. (2003) only demonstrated their technique for planar 
source geometries. Can you extend their technique to capture shadows acquired from 
an irregular source geometry? 


9. (Optional) Can you change the direction of the shadow, i.e., simulate the effect of 


changing the light source direction? 


10. (Optional) Re-implement the facial shadow removal algorithm of Zhang, Barron et al. 
(2020) and try applying it to other domains. 


Ex 10.10: Texture synthesis. Implement one of the texture synthesis or hole filling algo- 


rithms presented in Section 10.5. Here is one possible procedure: 


1. Implement the basic Efros and Leung (1999) algorithm, i.e., starting from the outside 
(for hole filling) or in raster order (for texture synthesis), search for a similar neighbor- 
hood in the source texture image, and copy that pixel. 


2. Add in the Wei and Levoy (2000) extension of generating the pixels in a coarse-to-fine 
fashion, i.e., generate a lower-resolution synthetic texture (or filled image), and use this 
as a guide for matching regions in the finer resolution version. 


3. Add in the Criminisi, Pérez, and Toyama (2004) idea of prioritizing pixels to be filled 


by some function of the local structure (gradient or orientation strength). 
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4. Extend any of the above algorithms by selecting sub-blocks in the source texture and 
using optimization to determine the seam between the new block and the existing image 
that it overlaps (Efros and Freeman 2001). 


5. (Optional) Implement one of the isophote (smooth continuation) inpainting algorithms 
(Bertalmio, Sapiro et al. 2000; Telea 2004). 


6. (Optional) Add the ability to supply a target (reference) image (Efros and Freeman 
2001) or to provide sample filtered or unfiltered (reference and rendered) images (Hertz- 
mann, Jacobs et al. 2001), see Section 10.5.2. 


7. (Optional) Try some of the newer DNN-based inpainting algorithms described at the 
end of Section 10.5.1. 


Ex 10.11: Colorization. Implement the Levin, Lischinski, and Weiss (2004) colorization 
algorithm that is sketched out in Section 4.2.4 and Figure 4.10. If you prefer, you can im- 
plement this as a neural network (Zhang, Zhu et al. 2017). Find some historic monochrome 
photographs and some modern color ones. Write an interactive tool that lets you “pick” col- 
ors from a modern photo and paint over the old one. Tune the algorithm parameters to give 
you good results. Are you pleased with the results? Can you think of ways to make them 
look more “antique”, e.g., with softer (less saturated and edgy) colors? 

(Alternative) Implement or test out one of the newer “automatic colorization” algorithms 
such as Zhang, Isola, and Efros (2016) or (Vondrick, Shrivastava et al. 2018). 


Ex 10.12: Style transfer. Try some of the non-photorealistic rendering or style transfer al- 
gorithms from Sections 10.5.2-10.5.3 on your own images. Can you come up with surprising 
results? How about failure cases? 
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Shopping Fun photos 


(h) 


Figure 11.1 Structure from motion examples: (a) a two-dimensional calibration target 
(Zhang 2000) O 2000 IEEE; (b) single view metrology (Criminisi, Reid, and Zisserman 2000) 
© 2000 Springer. (c—d) line matching (Schmid and Zisserman 1997) O 1997 IEEE; (e—g) 3D 
reconstructions of Trafalgar Square, Great Wall of China, and Prague Old Town Square 
(Snavely, Seitz, and Szeliski 2006) O 2006 ACM; (h) smartphone augmented reality showing 
real-time depth occlusion effects (Valentin, Kowdle et al. 2018) O 2018 ACM. 
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The reconstruction of 3D models from images has been one of the central topics in computer 
vision since its inception (Figure 1.7). In fact, it was then believed that the construction of 
3D models was a prerequisite for scene understanding and recognition (Marr 1982), although 
work in the last few decades has proven otherwise. However, 3D modeling has also proven 
to be immensely useful in applications such as virtual tourism (Section 11.4.6), autonomous 


navigation (Section 11.5.1), and augmented reality (Section 11.5.2). 


In the last three chapters, we focused on techniques for establishing correspondences 
between 2D images and using these in a variety of applications such as image stitching, video 
enhancement, and computational photography. In this chapter, we turn to the topic of using 
such correspondences to build sparse 3D models of a scene and to re-localize cameras with 
respect to such models. While this process often involves simultaneously estimating both 3D 
geometry (structure) and camera pose (motion), it is commonly known (for historical reasons) 


as structure from motion (Ullman 1979). 


The topics of projective geometry and structure from motion are extremely rich and some 
excellent textbooks and surveys have been written on them (Faugeras and Luong 2001; Hart- 
ley and Zisserman 2004; Moons, Van Gool, and Vergauwen 2010; Ma, Soatto et al. 2012). 
This chapter skips over a lot of the richer material available in these books, such as the trifocal 
tensor and algebraic techniques for full self-calibration, and concentrates instead on the ba- 
sics that we have found useful in large-scale, image-based reconstruction problems (Snavely, 
Seitz, and Szeliski 2006). 


We begin this chapter in Section 11.1 with a review of commonly used techniques for 
calibrating the camera intrinsics, e.g., the focal length and radial distortion parameters we 
introduced in Sections 2.1.4—2.1.5. Next, we discuss how to estimate the extrinsic pose of a 
camera from 3D to 2D point correspondences (Section 11.2) as well as how to triangulate a 
set of 2D correspondences to estimate a point’s 3D location. We then look at the two-frame 
structure from motion problem (Section 11.3), which involves the determination of the epipo- 
lar geometry between two cameras and which can also be used to recover certain information 
about the camera intrinsics using self-calibration (Section 11.3.4). Section 11.4.1 looks at 
factorization approaches to simultaneously estimating structure and motion from large num- 
bers of point tracks using orthographic approximations to the projection model. We then 
develop a more general and useful approach to structure from motion, namely the simultane- 
ous bundle adjustment of all the camera and 3D structure parameters (Section 11.4.2). We 
also look at special cases that arise when there are higher-level structures, such as lines and 
planes, in the scene (Section 11.4.8). In the last part of this chapter (Section 11.5), we look 
at real-time systems for simultaneous localization and mapping (SLAM), which reconstruct 


a 3D world model while moving through an environment, and can be applied to both visual 
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navigation and augmented reality. 


11.1 Geometric intrinsic calibration 


As we discuss in the next section (Equations (11.14—11.15)), the computation of the internal 
(intrinsic) camera calibration parameters can occur simultaneously with the estimation of the 
(extrinsic) pose of the camera with respect to a known calibration target. This, indeed, is the 
“classic” approach to camera calibration used in both the photogrammetry (Slama 1980) and 
the computer vision (Tsai 1987) communities. In this section, we look at simpler alternative 
formulations that may not involve the full solution of a non-linear regression problem, the use 
of alternative calibration targets, and the estimation of the non-linear part of camera optics 
such as radial distortion. In some applications, you can use the EXIF tags associated with 
a JPEG image to obtain a rough estimate of a camera’s focal length and hence to initialize 
iterative estimation algorithms; but this technique should be used with caution as the results 


are often inaccurate. 


Calibration patterns 


The use of a calibration pattern or set of markers is one of the more reliable ways to estimate 
a camera’s intrinsic parameters. In photogrammetry, it is common to set up a camera in a 
large field looking at distant calibration targets whose exact location has been precomputed 
using surveying equipment (Slama 1980; Atkinson 1996; Kraus 1997). In this case, the trans- 
lational component of the pose becomes irrelevant and only the camera rotation and intrinsic 
parameters need to be recovered. 

If a smaller calibration rig needs to be used, e.g., for indoor robotics applications or for 
mobile robots that carry their own calibration target, it is best if the calibration object can span 
as much of the workspace as possible (Figure 11.2a), as planar targets often fail to accurately 
predict the components of the pose that lie far away from the plane. A good way to determine 
if the calibration has been successfully performed is to estimate the covariance in the param- 
eters (Section 8.1.4) and then project 3D points from various points in the workspace into the 
image in order to estimate their 2D positional uncertainty. 

If no calibration pattern is available, it is also possible to perform calibration simulta- 
neously with structure and pose recovery (Sections 11.1.3 and 11.4.2), which is known as 
self-calibration (Faugeras, Luong, and Maybank 1992; Pollefeys, Koch, and Van Gool 1999; 
Hartley and Zisserman 2004; Moons, Van Gool, and Vergauwen 2010). However, such an 


approach requires a large amount of imagery to be accurate. 
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(a) (b) 


Figure 11.2 Calibration patterns: (a) a three-dimensional target (Quan and Lan 1999) 
© 1999 IEEE; (b) a two-dimensional target (Zhang 2000) O 2000 IEEE. Note that radial 
distortion needs to be removed from such images before the feature points can be used for 


calibration. 


Planar calibration patterns 


When a finite workspace is being used and accurate machining and motion control platforms 
are available, a good way to perform calibration is to move a planar calibration target through 
the workspace volume and use the known 3D point locations for calibration. This approach 
is sometimes called the N-planes calibration approach (Gremban, Thorpe, and Kanade 1988; 
Champleboux, Lavallée et al. 1992b; Grossberg and Nayar 2001) and has the advantage that 
each camera pixel can be mapped to a unique 3D ray in space, which takes care of both linear 
effects modeled by the calibration matrix K and non-linear effects such as radial distortion 
(Section 11.1.4). 

A less cumbersome but also less accurate calibration can be obtained by waving a planar 
calibration pattern in front of a camera (Figure 11.2b). In this case, the pattern’s pose has to 
be recovered in conjunction with the intrinsics. In this technique, each input image is used 
to compute a separate homography (8.19-8.23) H mapping the plane’s calibration points 
(Xi, Y;, 1) into image coordinates (x;, yi), 


xi = |yi| ~K [ro rı tl Y, | ~ Hpi, (11.1) 
1 1 


where the r; are the first two columns of R and ~ indicates equality up to scale. From 
these, Zhang (2000) shows how to form linear constraints on the nine entries in the B = 


K-7K-! matrix, from which the calibration matrix K can be recovered using a matrix 


11.1 Geometric intrinsic calibration 687 


(a) (b) 


Figure 11.3 Calibration from vanishing points: (a) any pair of finite vanishing points 
($i, £j) can be used to estimate the focal length; (b) the orthocenter of the vanishing point 


triangle gives the image center of the image c. 


square root and inversion. The matrix B is known as the image of the absolute conic (IAC) 
in projective geometry and is commonly used for camera calibration (Hartley and Zisserman 
2004, Section 8.5). If only the focal length is being recovered, the even simpler approach of 


using vanishing points described below can be used instead. 


11.1.1 Vanishing points 


A common case for calibration that occurs often in practice is when the camera is looking at a 
manufactured or architectural scene with long extended rectangular patterns such as boxes or 
building walls. In this case, we can intersect the 2D lines corresponding to 3D parallel lines 
to compute their vanishing points, as described in Section 7.4.3, and use these to determine 
the intrinsic and extrinsic calibration parameters (Caprile and Torre 1990; Becker and Bove 
1995; Liebowitz and Zisserman 1998; Cipolla, Drummond, and Robertson 1999; Antone and 
Teller 2002; Criminisi, Reid, and Zisserman 2000; Hartley and Zisserman 2004; Pflugfelder 
2008). 

Let us assume that we have detected two or more orthogonal vanishing points, all of which 
are finite, 1.e., they are not obtained from lines that appear to be parallel in the image plane 
(Figure 11.3a). Let us also assume a simplified form for the calibration matrix K where only 
the focal length is unknown (2.59). It is often safe for rough 3D modeling to assume that the 
optical center is at the center of the image, that the aspect ratio is 1, and that there is no skew. 


In this case, the projection equation for the vanishing points can be written as 


Li Cy 
xi = |yi—cy| Rp; =r;, (11.2) 
f 


where p; corresponds to one of the cardinal directions (1, 0,0), (0, 1,0), or (0,0, 1), and r; 
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(b) 


Figure 11.4 Single view metrology (Criminisi, Reid, and Zisserman 2000) O 2000 
Springer: (a) input image showing the three coordinate axes computed from the two hori- 
zontal vanishing points (which can be determined from the sidings on the shed); (b) a new 


view of the 3D reconstruction. 


is the ¿th column of the rotation matrix R. 
From the orthogonality between columns of the rotation matrix, we have 


ri -rj ~ (Li — Cn) la; — Ce) + (Yi — Gy) (Yj — cy) +f? =0, iF (11.3) 


from which we can obtain an estimate for f?. Note that the accuracy of this estimate increases 
as the vanishing points move closer to the center of the image. In other words, it is best to tilt 
the calibration pattern a decent amount around the 45° axis, as in Figure 11.3a. Once the focal 
length f has been determined, the individual columns of R can be estimated by normalizing 
the left-hand side of (11.2) and taking cross products. Alternatively, the orthogonal Procrustes 
algorithm (8.32) can be used. 

If all three vanishing points are visible and finite in the same image, it is also possible 
to estimate the image center as the orthocenter of the triangle formed by the three vanishing 
points (Caprile and Torre 1990; Hartley and Zisserman 2004, Section 8.6) (Figure 11.3b). 
In practice, however, it is more accurate to re-estimate any unknown intrinsic calibration 
parameters using non-linear least squares (11.14). 


11.1.2 Application: Single view metrology 


A fun application of vanishing point estimation and camera calibration is the single view 
metrology system developed by Criminisi, Reid, and Zisserman (2000). Their system allows 
people to interactively measure heights and other dimensions as well as to build piecewise- 
planar 3D models, as shown in Figure 11.4. 
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The first step in their system is to identify two orthogonal vanishing points on the ground 
plane and the vanishing point for the vertical direction, which can be done by drawing some 
parallel sets of lines in the image. Alternatively, automated techniques such as those discussed 
in Section 7.4.3 or by Schaffalitzky and Zisserman (2000) could be used. The user then 
marks a few dimensions in the image, such as the height of a reference object, and the system 
can automatically compute the height of another object. Walls and other planar impostors 
(geometry) can also be sketched and reconstructed. 

In the formulation originally developed by Criminisi, Reid, and Zisserman (2000), the 
system produces an affine reconstruction, i.e., one that is only known up to a set of indepen- 
dent scaling factors along each axis. A potentially more useful system can be constructed by 
assuming that the camera is calibrated up to an unknown focal length, which can be recovered 
from orthogonal (finite) vanishing directions, as we have just described in Section 11.1.1. 
Once this is done, the user can indicate an origin on the ground plane and another point a 
known distance away. From this, points on the ground plane can be directly projected into 
3D, and points above the ground plane, when paired with their ground plane projections, can 
also be recovered. A fully metric reconstruction of the scene then becomes possible. 

Exercise 11.4 has you implement such a system and then use it to model some simple 
3D scenes. Section 13.6.1 describes other, potentially multi-view, approaches to architectural 
reconstruction, including an interactive piecewise-planar modeling system that uses vanishing 


points to establish 3D line directions and plane normals (Sinha, Steedly et al. 2008). 


11.1.3 Rotational motion 


When no calibration targets or known structures are available but you can rotate the camera 
around its front nodal point (or, equivalently, work in a large open environment where all ob- 
jects are distant), the camera can be calibrated from a set of overlapping images by assuming 
that it is undergoing pure rotational motion, as shown in Figure 11.5 (Stein 1995; Hartley 
1997b; Hartley, Hayman et al. 2000; de Agapito, Hayman, and Reid 2001; Kang and Weiss 
1999; Shum and Szeliski 2000; Frahm and Koch 2003). When a full 360° motion is used 
to perform this calibration, a very accurate estimate of the focal length f can be obtained, 
as the accuracy in this estimate is proportional to the total number of pixels in the resulting 
cylindrical panorama (Section 8.2.6) (Stein 1995; Shum and Szeliski 2000). 

To use this technique, we first compute the homographies H,; between all overlapping 
pairs of images, as explained in Equations (8.19-8.23). Then, we use the observation, first 
made in Equation (2.72) and explored in more detail in Equation (8.38), that each homogra- 


phy is related to the inter-camera rotation R;¿ through the (unknown) calibration matrices K; 
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Figure 11.5 Four images taken with a hand-held camera registered using a 3D rotation 
motion model, which can be used to estimate the focal length of the camera (Szeliski and 
Shum 1997) O 2000 ACM. 


and K;, 
H;; =K,R¡R¡'K;! =K,¡R;¿Kj?. (11.4) 
The simplest way to obtain the calibration is to use the simplified form of the calibration 
matrix (2.59), where we assume that the pixels are square and the image center lies at the 
geometric center of the 2D pixel array, i.e., K;, = diag( fz, fk, 1). We subtract half the width 
and height from the original pixel coordinates to that the pixel (x, y) = (0, 0) lies at the center 
of the image. We can then rewrite Equation (11.4) as 


g hoo hor fo ‘hoz 
Rio~Ky*HioKo~ | ho hu fo haz |; (11.5) 
fihao fiha fg fiher 
where h;; are the elements of Hyp. 


Using the orthonormality properties of the rotation matrix R49 and the fact that the right- 
hand side of (11.5) is known only up to a scale, we obtain 


hoo + hor + fo hoz = hio + hia + fo “hio (11.6) 
and 
hoohio + horhi + fo hozhi2 = 0. (11.7) 
From this, we can compute estimates for fo of 


p= ha — hao 
2 = 
ho + hôi — hio — hia 


if h+ hô, Ahi) + hî (11.8) 
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or 
h h 
2 021412 r 
fo hoohio horha11 1 00 10 É 01411 ( ) 


If neither of these conditions holds, we can also take the dot products between the first (or 


second) row and the third one. Similar results can be obtained for fı as well, by analyzing the 
columns of Hj. If the focal length is the same for both images, we can take the geometric 
mean of fg and fı as the estimated focal length f = Y/f, fo. When multiple estimates of 
f are available, e.g., from different homographies, the median value can be used as the final 
estimate. A more general (upper-triangular) estimate of K can be obtained in the case of a 
fixed-parameter camera K; = K using the technique of Hartley (1997b). Extensions to the 
cases of temporally varying calibration parameters and non-stationary cameras are discussed 
by Hartley, Hayman et al. (2000) and de Agapito, Hayman, and Reid (2001). 

The quality of the intrinsic camera parameters can be greatly increased by constructing 
a full 360° panorama, as mis-estimating the focal length will result in a gap (or excessive 
overlap) when the first image in the sequence is stitched to itself (Figure 8.6). The resulting 
misalignment can be used to improve the estimate of the focal length and to re-adjust the 
rotation estimates, as described in Section 8.2.4. Rotating the camera by 90° around its optical 
axis and re-shooting the panorama is a good way to check for aspect ratio and skew pixel 
problems, as is generating a full hemi-spherical panorama when there is sufficient texture. 

Ultimately, however, the most accurate estimate of the calibration parameters (including 
radial distortion) can be obtained using a full simultaneous non-linear minimization of the 


intrinsic and extrinsic (rotation) parameters, as described in Section 11.2.2. 


11.1.4 Radial distortion 


When images are taken with wide-angle lenses, it is often necessary to model lens distor- 
tions such as radial distortion. As discussed in Section 2.1.5, the radial distortion model says 
that coordinates in the observed images are displaced towards (barrel distortion) or away 
(pincushion distortion) from the image center by an amount proportional to their radial dis- 
tance (Figure 2.13a—b). The simplest radial distortion models use low-order polynomials (c.f. 
Equation (2.78)), 


(11.10) 


where (x,y) = (0,0) at the radial distortion center (2.77), r? = x? + y?, and «1 and 2 are 


called the radial distortion parameters (Brown 1971; Slama 1980).! 


| Sometimes the relationship between x and ĉ is expressed the other way around, i.e., using primed (final) coor- 


dinates on the right-hand side, x = ¿(1 + «17? + K2f*). This is convenient if we map image pixels into (warped) 
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A variety of techniques can be used to estimate the radial distortion parameters for a 
given lens, if the digital camera has not already done this in its capture software. One of the 
simplest and most useful is to take an image of a scene with a lot of straight lines, especially 
lines aligned with and near the edges of the image. The radial distortion parameters can 
then be adjusted until all of the lines in the image are straight, which is commonly called 
the plumb-line method (Brown 1971; Kang 2001; El-Melegy and Farag 2003). Exercise 11.5 
gives some more details on how to implement such a technique. 

Another approach is to use several overlapping images and to combine the estimation 
of the radial distortion parameters with the image alignment process, 1.e., by extending the 
pipeline used for stitching in Section 8.3.1. Sawhney and Kumar (1999) use a hierarchy 
of motion models (translation, affine, projective) in a coarse-to-fine strategy coupled with 
a quadratic radial distortion correction term. They use direct (intensity-based) minimiza- 
tion to compute the alignment. Stein (1997) uses a feature-based approach combined with 
a general 3D motion model (and quadratic radial distortion), which requires more matches 
than a parallax-free rotational panorama but is potentially more general. More recent ap- 
proaches sometimes simultaneously compute both the unknown intrinsic parameters and the 
radial distortion coefficients, which may include higher-order terms or more complex rational 
or non-parametric forms (Claus and Fitzgibbon 2005; Sturm 2005; Thirthala and Pollefeys 
2005; Barreto and Daniilidis 2005; Hartley and Kang 2005; Steele and Jaynes 2006; Tardif, 
Sturm et al. 2009). 

When a known calibration target is being used (Figure 11.2), the radial distortion estima- 
tion can be folded into the estimation of the other intrinsic and extrinsic parameters (Zhang 
2000; Hartley and Kang 2007; Tardif, Sturm ef al. 2009). This can be viewed as adding 
another stage to the general non-linear minimization pipeline shown in Figure 11.7 between 
the intrinsic parameter multiplication box fç and the perspective division box fp. (See Exer- 
cise 11.6 on more details for the case of a planar calibration target.) 

Of course, as discussed in Section 2.1.5, more general models of lens distortion, such as 
fisheye and non-central projection, may sometimes be required. While the parameterization 
of such lenses may be more complicated (Section 2.1.5), the general approach of either us- 
ing calibration rigs with known 3D positions or self-calibration through the use of multiple 
overlapping images of a scene can both be used (Hartley and Kang 2007; Tardif, Sturm, and 
Roy 2007). The same techniques used to calibrate for radial distortion can also be used to 
reduce the amount of chromatic aberration by separately calibrating each color channel and 


then warping the channels to put them back into alignment (Exercise 11.7). 


rays and then undistort the rays to obtain 3D rays in space, i.e., if we are using inverse warping. 
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11.2 Pose estimation 


A particular instance of feature-based alignment, which occurs very often, is estimating an 
objects 3D pose from a set of 2D point projections. This pose estimation problem is also 
known as extrinsic calibration, as opposed to the intrinsic calibration of internal camera pa- 
rameters such as focal length, which we discuss in Section 11.1. The problem of recovering 
pose from three correspondences, which is the minimal amount of information necessary, 
is known as the perspective-3-point-problem (P3P),” with extensions to larger numbers of 
points collectively known as PnP (Haralick, Lee et al. 1994; Quan and Lan 1999; Gao, Hou 
et al. 2003; Moreno-Noguer, Lepetit, and Fua 2007; Persson and Nordberg 2018). 

In this section, we look at some of the techniques that have been developed to solve such 
problems, starting with the direct linear transform (DLT), which recovers a 3 x 4 camera ma- 
trix, followed by other “linear” algorithms, and then looking at statistically optimal iterative 


algorithms. 


11.2.1 Linear algorithms 


The simplest way to recover the pose of the camera is to form a set of rational linear equations 
analogous to those used for 2D motion estimation (8.19) from the camera matrix form of 


perspective projection (2.55-2.56), 


re Poo XA; + Por Y; + Po2Zi + Pos 


= (11.11) 
p20 Xi + pai Yi + p22Zi + p23 
Xi Y; + pı2Zi + 
j= Pi0oAi + Pir Pi2 pis (11.12) 
Pa0X; + par Yi + P224i + P23 


where (x;,y;) are the measured 2D feature locations and (X;, Y;, Z¿) are the known 3D 
feature locations (Figure 11.6). As with (8.21), this system of equations can be solved in a 
linear fashion for the unknowns in the camera matrix P by multiplying the denominator on 
both sides of the equation.Because P is unknown up to a scale, we can either fix one of the 
entries, €.g., p23 = 1, or find the smallest singular vector of the set of linear equations. The 
resulting algorithm is called the direct linear transform (DLT) and is commonly attributed 
to Sutherland (1974). (For a more in-depth discussion, see Hartley and Zisserman (2004).) 
To compute the 12 (or 11) unknowns in P, at least six correspondences between 3D and 2D 
locations must be known. 

As with the case of estimating homographies (8.21-8.23), more accurate results for the 


entries in P can be obtained by directly minimizing the set of Equations (11.11—11.12) using 


2The “3-point” algorithms actually require a 4th point to resolve a 4-way ambiguity. 
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Pi = 0 Y; Z; Wi) 


Figure 11.6 Pose estimation by the direct linear transform and by measuring visual angles 


and distances between pairs of points. 


non-linear least squares with a small number of iterations. Note that instead of taking the 
ratios of the X/Z and Y/Z values as in (11.11-11.12), it is also possible to take a cross 
product of the 3-vector (x;, y;, 1) image measurement and the 3-D ray (X,Y, Z) and set the 
three elements of this cross-product to 0. The resulting three equations, when interpreted as 
a set of least squares constraints, in effect compute the squared sine of the angle between the 
two rays. 

Once the entries in P have been recovered, it is possible to recover both the intrinsic 
calibration matrix K and the rigid transformation (R, t) by observing from Equation (2.56) 
that 


P = K[Rit]. (11.13) 


Because K is upper-triangular (see the discussion in Section 2.1.4), both K and R can be 
obtained from the front 3 x 3 sub-matrix of P using RQ factorization (Golub and Van Loan 
1996).5 

In most applications, however, we have some prior knowledge about the intrinsic calibra- 
tion matrix K, e.g., that the pixels are square, the skew is very small, and the image center is 
near the geometric center of the image (2.57-2.59). Such constraints can be incorporated into 
a non-linear minimization of the parameters in K and (R, t), as described in Section 11.2.2. 

In the case where the camera is already calibrated, i.e., the matrix K is known (Sec- 
tion 11.1), we can perform pose estimation using as few as three points (Fischler and Bolles 
1981; Haralick, Lee et al. 1994; Quan and Lan 1999). The basic observation that these linear 
PnP (perspective n-point) algorithms employ is that the visual angle between any pair of 2D 


points X; and x; must be the same as the angle between their corresponding 3D points p; and 


3Note the unfortunate clash of terminologies: In matrix algebra textbooks, R. represents an upper-triangular 


matrix; in computer vision, R is an orthogonal rotation. 
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p; (Figure 11.6). 

A full derivation of this approach can be found in the first edition of this book (Szeliski 
2010, Section 6.2.1) and also in (Quan and Lan 1999), where the authors provide accuracy 
results for this and other techniques, which use fewer points but require more complicated 
algebraic manipulations. The paper by Moreno-Noguer, Lepetit, and Fua (2007) reviews 
other alternatives and also gives a lower complexity algorithm that typically produces more 
accurate results. An even more recent paper by Terzakis and Lourakis (2020) reviews papers 
published in the last decade. 

Unfortunately, because minimal PnP solutions can be quite noise sensitive and also suffer 
from bas-relief ambiguities (e.g., depth reversals) (Section 11.4.5), it is prudent to optimize 
the initial estimates from PnP using the iterative technique described in Section 11.2.2. An 
alternative pose estimation algorithm involves starting with a scaled orthographic projection 
model and then iteratively refining this initial estimate using a more accurate perspective 
projection model (DeMenthon and Davis 1995). The attraction of this model, as stated in the 


paper’s title, is that it can be implemented “in 25 lines of [Mathematica] code”. 


CNN-based pose estimation 


As with other areas on computer vision, deep neural networks have also been applied to pose 
estimation. Some representative papers include Xiang, Schmidt et al. (2018), Oberweger, 
Rad, and Lepetit (2018), Hu, Hugonot et al. (2019), Peng, Liu et al. (2019), and (Hu, Fua 
et al. 2020) for object pose estimation, and papers such as Kendall and Cipolla (2017) and 
Kim, Dunn, and Frahm (2017) discussed in Section 11.2.3 on location recognition. There 
is also a very active community around estimating pose from RGB-D images, with the most 
recent papers (Hagelskjær and Buch 2020; Labbé, Carpentier ef al. 2020) evaluated on the 
BOP (Benchmark for 6DOF Object Pose) (Hodan, Michel et al. 2018).* 


11.2.2 Iterative non-linear algorithms 


The most accurate and flexible way to estimate pose is to directly minimize the squared (or 
robust) reprojection error for the 2D points as a function of the unknown pose parameters in 
(R, t) and optionally K using non-linear least squares (Tsai 1987; Bogart 1991; Gleicher and 
Witkin 1992). We can write the projection equations as 


x; = f(p;;R,t, K) (11.14) 


4https://bop.felk.cvut.cz/challenges/bop-challenge-2020, https://cmp.felk.evut.cz/sixd/workshop-2020 
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Xi fc(x) = Kx fo(x) = p/z 


Figure 11.7 A set of chained transforms for projecting a 3D point p; to a 2D measurement 
x; through a series of transformations £"), each of which is controlled by its own set of 
parameters. The dashed lines indicate the flow of information as partial derivatives are 


computed during a backward pass. 


and iteratively minimize the robustified linearized reprojection errors 


af of of 
Entr = >> p (sear t a At+ a AK rs) (11.15) 


i 
where r; = X; — X; is the current residual vector (2D error in predicted position) and the 
partial derivatives are with respect to the unknown pose parameters (rotation, translation, and 
optionally calibration). The robust loss function p, which we first introduced in (4.15) in 
Section 4.1.3, is used to reduce the influence of outlier correspondences. Note that if full 2D 
covariance estimates are available for the 2D feature locations, the above squared norm can 
be weighted by the inverse point covariance matrix, as in Equation (8.11). 

An easier to understand (and implement) version of the above non-linear regression prob- 
lem can be constructed by re-writing the projection equations as a concatenation of simpler 
steps, each of which transforms a 4D homogeneous coordinate p; by a simple transformation 
such as translation, rotation, or perspective division (Figure 11.7). The resulting projection 


equations can be written as 


y) = fr(pi; cj) = pi — cj, (11.16) 

y® =fr(y";q;) =R(g;) y®, (11.17) 
(2) 

y® = f(y) = nor (11.18) 

xi = fo(y®; k) = K(k) y®. (11.19) 


Note that in these equations, we have indexed the camera centers c; and camera rotation 
quaternions q; by an index j, in case more than one pose of the calibration object is being 
used (see also Section 11.4.2.) We are also using the camera center c; instead of the world 


translation t;, as this is a more natural parameter to estimate. 
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The advantage of this chained set of transformations is that each one has a simple partial 
derivative with respect both to its parameters and to its input. Thus, once the predicted value 
of X; has been computed based on the 3D point location p; and the current values of the pose 
parameters (cj, qj, k), we can obtain all of the required partial derivatives using the chain 


rule 
Or; Or; dy*) 


3000 = By 9000" (11.20) 


where p(*) indicates one of the parameter vectors that is being optimized. (This same “trick” 
1s used in neural networks as part of backpropagation, which we presented in Figure 5.31.) 
The one special case in this formulation that can be considerably simplified is the com- 
putation of the rotation update. Instead of directly computing the derivatives of the 3 x 3 
rotation matrix R(q) as a function of the unit quaternion entries, you can prepend the incre- 
mental rotation matrix AR(w) given in Equation (2.35) to the current rotation matrix and 
compute the partial derivative of the transform with respect to these parameters, which re- 
sults in a simple cross product of the backward chaining partial derivative and the outgoing 


3D vector, as explained in Equation (2.36). 


Target-based augmented reality 


A widely used application of pose estimation is augmented reality, where virtual 3D images 
or annotations are superimposed on top of a live video feed, either through the use of see- 
through glasses (a head-mounted display) or on a regular computer or mobile device screen 
(Azuma, Baillot et al. 2001; Haller, Billinghurst, and Thomas 2007; Billinghurst, Clark, and 
Lee 2015). In some applications, a special pattern printed on cards or in a book is tracked to 
perform the augmentation (Kato, Billinghurst et al. 2000; Billinghurst, Kato, and Poupyrev 
2001). For a desktop application, a grid of dots printed on a mouse pad can be tracked by 
a camera embedded in an augmented mouse to give the user control of a full six degrees of 
freedom over their position and orientation in a 3D space (Hinckley, Sinclair et al. 1999). 
Today, tracking known targets such as movie posters is used in some phone-based augmented 
reality systems such as Facebook’s Spark AR.? 

Sometimes, the scene itself provides a convenient object to track, such as the rectangle 
defining a desktop used in through-the-lens camera control (Gleicher and Witkin 1992). In 
outdoor locations, such as film sets, it is more common to place special markers such as 
brightly colored balls in the scene to make it easier to find and track them (Bogart 1991). In 
older applications, surveying techniques were used to determine the locations of these balls 


Shttps://sparkar.facebook.com/ar-studio 
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before filming. Today, it is more common to apply structure-from-motion directly to the film 
footage itself (Section 11.5.2). 
Exercise 8.4 has you implement a tracking and pose estimation system for augmented- 


reality applications. 


11.2.3 Application: Location recognition 


One of the most exciting applications of pose estimation is in the area of location recognition, 
which can be used both in desktop applications (“Where did I take this holiday snap?”) and 
in mobile smartphone applications. The latter case includes not only finding out your current 
location based on a cell-phone image, but also providing you with navigation directions or 
annotating your images with useful information, such as building names and restaurant re- 
views (i.e., a pocketable form of augmented reality). This problem is also often called visual 
(or image-based) localization (Se, Lowe, and Little 2002; Zhang and Kosecka 2006; Janai, 
Güney et al. 2020, Section 13.3) or visual place recognition (Lowry, Siinderhauf et al. 2015). 

Some approaches to location recognition assume that the photos consist of architectural 
scenes for which vanishing directions can be used to pre-rectify the images for easier match- 
ing (Robertson and Cipolla 2004). Other approaches use general affine covariant interest 
points to perform wide baseline matching (Schaffalitzky and Zisserman 2002), with the win- 
ning entry on the ICCV 2005 Computer Vision Contest (Szeliski 2005) using this approach 
(Zhang and Kosecka 2006). The Photo Tourism system of Snavely, Seitz, and Szeliski (2006) 
(Section 14.1.2) was the first to apply these kinds of ideas to large-scale image matching and 
(implicit) location recognition from internet photo collections taken under a wide variety of 
viewing conditions. 

The main difficulty in location recognition is in dealing with the extremely large com- 
munity (user-generated) photo collections on websites such as Flickr (Philbin, Chum et al. 
2007; Chum, Philbin et al. 2007; Philbin, Chum et al. 2008; Irschara, Zach et al. 2009; 
Turcot and Lowe 2009; Sattler, Leibe, and Kobbelt 2011, 2017) or commercially captured 
databases (Schindler, Brown, and Szeliski 2007; Klingner, Martin, and Roseborough 2013; 
Torii, Arandjelovié et al. 2018). The prevalence of commonly appearing elements such as 
foliage, signs, and common architectural elements further complicates the task (Schindler, 
Brown, and Szeliski 2007; Jegou, Douze, and Schmid 2009; Chum and Matas 2010b; Knopp, 
Sivic, and Pajdla 2010; Torii, Sivic et al. 2013; Sattler, Havlena et al. 2016). Figure 7.26 
shows some results on location recognition from community photo collections, while Fig- 
ure 11.8 shows sample results from denser commercially acquired datasets. In the latter 
case, the overlap between adjacent database images can be used to verify and prune potential 


matches using “temporal” filtering, i.e., requiring the query image to match nearby overlap- 
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Figure 11.8  Feature-based location recognition (Schindler, Brown, and Szeliski 2007) O 
2007 IEEE: (a) three typical series of overlapping street photos; (b) handheld camera shots 


and (c) their corresponding database photos. 


ping database images before accepting the match. Similar ideas have been used to improve 
location recognition from panoramic video sequences (Levin and Szeliski 2004; Samano, 
Zhou, and Calway 2020) and to combine local SLAM reconstructions from image sequences 
with matching against a precomputed map for higher reliability (Stenborg, Sattler, and Ham- 
marstrand 2020). Recognizing indoor locations inside buildings and shopping malls poses its 
own set of challenges, including textureless areas and repeated elements (Levin and Szeliski 
2004; Wang, Fidler, and Urtasun 2015; Sun, Xie et al. 2017; Taira, Okutomi et al. 2018; Taira, 
Rocco et al. 2019; Lee, Ryu et al. 2021). The matching of ground-level to aerial images has 
also been studied (Kaminsky, Snavely et al. 2009; Shan, Wu et al. 2014). 

Some of the initial research on location recognition was organized around the Oxford 5k 
and Paris 6k datasets (Philbin, Chum ef al. 2007, 2008; Radenovié, Iscen et al. 2018), as well 
as the Vienna (Irschara, Zach et al. 2009) and Photo Tourism (Li, Snavely, and Huttenlocher 
2010) datasets, and later around the 7 scenes indoor RGB-D dataset (Shotton, Glocker et al. 
2013) and Cambridge Landmarks (Kendall, Grimes, and Cipolla 2015). The NetVLAD paper 
(Arandjelovic, Gronat et al. 2016) was tested on Google Street View Time Machine data. Cur- 
rently, the most widely used visual localization datasets are collected at the Long-Term Visual 
Localization Benchmark! and include such datasets as Aachen Day-Night (Sattler, Maddern 
et al. 2018) and InLoc (Taira, Okutomi et al. 2018). And while most localization systems 
work from collections of ground-level images, it is also possible to re-localize based on tex- 


tured digital elevation (terrain) models for outdoor (non-city) applications (Baatz, Saurer ef 


Shttps://www.visuallocalization.net 
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al. 2012; Brejcha, Lukáč et al. 2020). 


Some of the most recent approaches to localization use deep networks to generate feature 
descriptors (Arandjelovic, Gronat et al. 2016; Kim, Dunn, and Frahm 2017; Torii, Arand- 
jelovié et al. 2018; Radenovié, Tolias, and Chum 2019; Yang, Kien Nguyen et al. 2019; 
Sarlin, Unagar et al. 2021), perform large-scale instance retrieval (Radenovié, Tolias, and 
Chum 2019; Cao, Araujo, and Sim 2020; Ng, Balntas ef al. 2020; Tolias, Jenicek, and Chum 
2020; Pion, Humenberger et al. 2020 and Section 6.2.3), map images to 3D scene coordinates 
(Brachmann and Rother 2018), or perform end-to-end scene coordinate regression (Shotton, 
Glocker et al. 2013), absolute pose regression (APR) (Kendall, Grimes, and Cipolla 2015; 
Kendall and Cipolla 2017), or relative pose regression (RPR) (Melekhov, Ylioinas et al. 
2017; Balntas, Li, and Prisacariu 2018). Recent evaluations of these techniques have shown 
that classical approaches based on feature matching followed by geometric pose optimiza- 
tion typically outperform pose regression approaches in terms of accuracy and generalization 
(Sattler, Zhou et al. 2019; Zhou, Sattler et al. 2019; Ding, Wang et al. 2019; Lee, Ryu et al. 
2021; Sarlin, Unagar et al. 2021). 


The Long-Term Visual Localization benchmark has a leaderboard listing the best-performing 
localization systems. In the CVPR 2020 workshop and challenge, some of the winning en- 
tries were based on recent detectors, descriptors, and matchers such as SuperGlue (Sarlin, 
DeTone et al. 2020), ASLFeat (Luo, Zhou et al. 2020), and R2D2 (Revaud, Weinzaepfel et 
al. 2019). Other systems that did well include HF-Net (Sarlin, Cadena et al. 2019), ONavi 
(Fan, Zhou et al. 2020), and D2-Net (Dusmanu, Rocco et al. 2019). An even more recent 
trend is to use DNNs or transformers to establish dense coarse-to-fine matches (Jiang, Trulls 
et al. 2021; Sun, Shen et al. 2021). 


Another variant on location recognition is the automatic discovery of landmarks, 1.e., fre- 
quently photographed objects and locations. Simon, Snavely, and Seitz (2007) show how 
these kinds of objects can be discovered simply by analyzing the matching graph constructed 
as part of the 3D modeling process in Photo Tourism. More recent work has extended this ap- 
proach to larger datasets using efficient clustering techniques (Philbin and Zisserman 2008; 
Li, Wu et al. 2008; Chum, Philbin, and Zisserman 2008; Chum and Matas 2010a; Arand- 
jelovié and Zisserman 2012), combining meta-data such as GPS and textual tags with visual 
search (Quack, Leibe, and Van Gool 2008; Crandall, Backstrom et al. 2009; Li, Snavely et al. 
2012), and using multiple descriptors to obtain real-time performance in micro aerial vehicle 
navigation (Lim, Sinha et al. 2012). It is now even possible to automatically associate object 
tags with images based on their co-occurrence in multiple loosely tagged images (Simon and 
Seitz 2008; Gammeter, Bossard et al. 2009). 


The concept of organizing the world’s photo collections by location has even been re- 
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(a) (b) 


Figure 11.9 Locating star fields using astrometry, https://astrometry.net. (a) Input star 
field and some selected star quads. (b) The 2D coordinates of stars C and D are encoded 
relative to the unit square defined by A and B. 


cently extended to organizing all of the universe’s (astronomical) photos in an application 
called astrometry.’ The technique used to match any two star fields is to take quadruplets of 
nearby stars (a pair of stars and another pair inside their diameter) to form a 30-bit geometric 
hash by encoding the relative positions of the second pair of points using the inscribed square 
as the reference frame, as shown in Figure 11.9. Traditional information retrieval techniques 
(k-d trees built for different parts of a sky atlas) are then used to find matching quads as po- 
tential star field location hypotheses, which can then be verified using a similarity transform. 


11.2.4 Triangulation 


The problem of determining a point’s 3D position from a set of corresponding image locations 
and known camera positions is known as triangulation. This problem is the converse of the 
pose estimation problem we studied in Section 11.2. 

One of the simplest ways to solve this problem is to find the 3D point p that lies closest 
to all of the 3D rays corresponding to the 2D matching feature locations {x;} observed by 
cameras {P; = K,[R,|t;]}, where t; = —R,c, and cj is the jth camera center (2.55-2.56). 
As you can see in Figure 11.10, these rays originate at c; in a direction Y; = N(R; °K; *x;). 
where N (v) normalizes a vector v to unit length. The nearest point to p on this ray, which 


we denote as q; = c; + d;¥,;, minimizes the distance 
lla, — Pll? = lle; + 45%; — pll’, (11.21) 
which has a minimum at dj = v; - (p — Cc). Hence, 


qj = cj + (¥j 97 )(p — cj) = cj + (p — cj) (11.22) 


Thttps://astrometry.net 
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Figure 11.10 3D point triangulation by finding the point p that lies nearest to all of the 


optical rays cj + d;¥;. 


in the notation of Equation (2.29), and the squared distance between p and q; is 
rj = |0- 9595 )(p — o)l? = II — c) l’. (11.23) 
The optimal value for p, which lies closest to all of the rays, can be computed as a regular 
least squares problem by summing over all the ie and finding the optimal value of p, 
<i 
p= |) (1- 9,9%) SN (1-9,97)03 | . (11.24) 


J j 


An alternative formulation, which is more statistically optimal and which can produce 
significantly better estimates if some of the cameras are closer to the 3D point than others, is 


to minimize the residual in the measurement equations 


A as 
Dio X + p31 Y + psy Z + p33 W 

yj = pio X a- +019 + igW (11.26) 
DIX + pS Y +p Z + pS) W 


where (xj, y;) are the measured 2D feature locations and {p T px} are the known entries 
in camera matrix P; (Sutherland 1974). 

As with Equations (8.21, 11.11, and 11.12), this set of non-linear equations can be con- 
verted into a linear least squares problem by multiplying both sides of the denominator, again 
resulting in the direct linear transform (DLT) formulation. Note that if we use homoge- 
neous coordinates p = (X,Y, Z, W), the resulting set of equations is homogeneous and is 
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best solved as a singular value decomposition (SVD) or eigenvalue problem (looking for the 
smallest singular vector or eigenvector). If we set W = 1, we can use regular linear least 
squares, but the resulting system may be singular or poorly conditioned, i.e., if all of the 
viewing rays are parallel, as occurs for points far away from the camera. 

For this reason, it is generally preferable to parameterize 3D points using homogeneous 
coordinates, especially if we know that there are likely to be points at greatly varying dis- 
tances from the cameras. Of course, minimizing the set of observations (11.25-11.26) using 
non-linear least squares, as described in (8.14 and 8.23), is preferable to using linear least 
squares, regardless of the representation chosen. 

For the case of two observations, it turns out that the location of the point p that exactly 
minimizes the true reprojection error (11.25—11.26) can be computed using the solution of 
degree six polynomial equations (Hartley and Sturm 1997). Another problem to watch out 
for with triangulation is the issue of cheirality, 1.e., ensuring that the reconstructed points lie 
in front of all the cameras (Hartley 1998). While this cannot always be guaranteed, a useful 
heuristic is to take the points that lie behind the cameras because their rays are diverging 
(imagine Figure 11.10 where the rays were pointing away from each other) and to place them 


on the plane at infinity by setting their W values to 0. 


11.3 Two-frame structure from motion 


So far in our study of 3D reconstruction, we have always assumed that either the 3D point 
positions or the 3D camera poses are known in advance. In this section, we take our first 
look at structure from motion, which is the simultaneous recovery of 3D structure and pose 
from image correspondences. In particular, we examine techniques that operate on just two 
frames with point correspondences. We divide this section into the study of classic “n- 
point” algorithms, special (degenerate) cases, projective (uncalibrated) reconstruction, and 
self-calibration for cameras whose intrinsic calibrations are unknown. 


11.3.1 Eight, seven, and five-point algorithms 


Consider Figure 11.11, which shows a 3D point p being viewed from two cameras whose 
relative position can be encoded by a rotation R and a translation t. As we do not know 
anything about the camera positions, without loss of generality, we can set the first camera at 


the origin cg = O and at a canonical orientation Ry = I. 


The 3D point po = dofo observed in the first image at location Xy and at a z distance of 
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epi polar plane 


(R,t) 


Figure 11.11 Epipolar geometry: The vectors t = cı — €o, p— co and p— cı are co-planar 
and define the basic epipolar constraint expressed in terms of the pixel measurements Xy and 


X1. 
dy is mapped into the second image by the transformation 


where X; = K; "xy are the (local) ray direction vectors. Taking the cross product of the two 
(interchanged) sides with t in order to annihilate it on the right-hand side yields® 


d1[t]x1 = dolt]x RxXo. (11.28) 
Taking the dot product of both sides with x, yields 
doXf ([t]R)Xo = di Xf [t],,%1 = 0, (11.29) 


because the right-hand side is a triple product with two identical entries. (Another way to 
say this is that the cross product matrix [t], is skew symmetric and returns O when pre- and 
post-multiplied by the same vector.) 


We therefore arrive at the basic epipolar constraint 
RTE ĝo = 0, (11.30) 


where 
E =|[t]¿R (11.31) 


is called the essential matrix (Longuet-Higgins 1981). 


8The cross-product operator [ ] x was introduced in (2.32). 
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An alternative way to derive the epipolar constraint is to notice that, for the cameras to be 
oriented so that the rays o and xX intersect in 3D at point p, the vectors connecting the two 
camera centers Cı — Cp = —R and the rays corresponding to pixels xg and x;, namely 
E, E must be co-planar. This requires that the triple product 


(Xo, R7121, -—R7't) = (Rxo,X1,-t) = X1 - (t x Río) = 2? ([t],.R)X = 0. (11.32) 


Notice that the essential matrix E maps a point Xo in image 0 into a line l = Exp = 
[t]  Rxo in image 1, because X71, = 0 (Figure 11.11). All such lines must pass through the 
second epipole e1, which is therefore defined as the left singular vector of E with a 0 singular 
value, or, equivalently, the projection of the vector t into image 1. The dual (transpose) of 
these relationships gives us the epipolar line in the first image as lọ = E7x; and ep as the 


zero-value right singular vector of E. 


Eight-point algorithm. Given this fundamental relationship (11.30), how can we use it to 
recover the camera motion encoded in the essential matrix E? If we have N corresponding 
measurements {(X;0, X;1)), we can form N homogeneous equations in the nine elements of 
E = {e00 . . . e22}, 
TioTiiloo FYioTiitor Ftitoz + 
TioYi1eoo +YioYiitm +Yirei2 + (11.33) 
Tiot20  FYioe21 +e22 =0 


where x;; = (2;;, yij, 1). This can be written more compactly as 
[xi xp] @ E=Z,@E=2,-f =0, (11.34) 


where ® indicates an element-wise multiplication and summation of matrix elements, and z; 
and f are the vectorized forms of the Z; = Rakh and E matrices.? Given N > 8 such 
equations, we can compute an estimate (up to scale) for the entries in E using an SVD. 

In the presence of noisy measurements, how close is this estimate to being statistically 
optimal? If you look at the entries in (11.33), you can see that some entries are the products 
of image measurements such as £;ọy;1 and others are direct image measurements (or even 
the identity). If the measurements have comparable noise, the terms that are products of 
measurements have their noise amplified by the other element in the product, which can lead 
to very poor scaling, e.g., an inordinately large influence of points with large coordinates (far 
away from the image center). 


°We use f instead of e to denote the vectorized form of E to avoid confusion with the epipoles e ¡> 
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To counteract this trend, Hartley (1997a) suggests that the point coordinates should be 


translated and scaled so that their centroid lies at the origin and their variance is unity, 1.e., 


i = s(%; — Hx) (11.35) 
Ji = 8(Yi — y) (11.36) 


such that >, č; = >>, Y = 0 and >, 2? +), 9? = 2n, where n is the number of points.!° 
Once the essential matrix E has been computed from the transformed coordinates 
{(Xio, Xi1)}, Where X;; = T;X;; and T} is the 3 x 3 matrix that implements the shift and 


scale operations in (11.35—11.36), the original essential matrix E can be recovered as 
E = TTET). (11.37) 


In his paper, Hartley (1997a) compares the improvement due to his re-normalization strategy 
to alternative distance measures proposed by others such as Zhang (1998a,b) and concludes 
that his simple re-normalization in most cases is as effective as (or better than) alternative 
techniques. Torr and Fitzgibbon (2004) recommend a variant on this algorithm where the 
norm of the upper 2 x 2 sub-matrix of E is set to 1 and show that it has even better stability 


with respect to 2D coordinate transformations. 


7-point algorithm. Because E is rank-deficient, it turns out that we actually only need 
seven correspondences of the form of Equation (11.34) instead of eight to estimate this matrix 
(Hartley 1994a; Torr and Murray 1997; Hartley and Zisserman 2004). The advantage of using 
fewer correspondences inside a RANSAC robust fitting stage is that fewer random samples 
need to be generated. From this set of seven homogeneous equations (which we can stack 
into a 7 x 9 matrix for SVD analysis), we can find two independent vectors, say fo and fı 
such that z; - f; = 0. These two vectors can be converted back into 3 x 3 matrices Ey and 


E,, which span the solution space for 
E = aEo + (1—a)E}. (11.38) 


To find the correct value of a, we observe that E has a zero determinant, as it is rank deficient, 


and hence 


laEo + (1 — a)E,| = 0. (11.39) 


!0More precisely, Hartley (1997a) suggests scaling the points “so that the average distance from the origin is equal 
to \/2” but the heuristic of unit variance is faster to compute (does not require per-point square roots) and should 


yield comparable improvements. 
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This gives us a cubic equation in a, which has either one or three solutions (roots). Substi- 
tuting these values into (11.38) to obtain E, we can test this essential matrix against other 
unused feature correspondences to select the correct one. 

The normalized “eight-point algorithm” (Hartley 1997a) and seven-point algorithm de- 
scribed above are not the only way to estimate the camera motion from correspondences. 
Additional variants include a five-point algorithm that requires finding the roots of a 10th 
degree polynomial (Nistér 2004) as well as variants that handle special (restricted) motions 
or scene structures, as discussed later on in this section. Because such algorithms use fewer 
points to compute their estimates, they are less sensitive to outliers when used as part of a 
random sampling (RANSAC) strategy.!' 


Recovering t and R. Once an estimate for the essential matrix E has been recovered, the 
direction of the translation vector t can be estimated. Note that the absolute distance between 
the two cameras can never be recovered from pure image measurements alone, regardless of 
how many cameras or points are used. Knowledge about absolute camera and point positions 
or distances, often called ground control points in photogrammetry, is always required to 
establish the final scale, position, and orientation. 

To estimate this direction £, observe that under ideal noise-free conditions, the essential 
matrix E is singular, i.e., $ E = 0. This singularity shows up as a singular value of 0 when 
an SVD of E is performed, 


E = [ÎR = UEV” = [uo T t] 1 vi |. (11.40) 


When E is computed from noisy measurements, the singular vector associated with the small- 
est singular value gives us t. (The other two singular values should be similar but are not, in 
general, equal to 1 because E is only computed up to an unknown scale.) 

Once t has been recovered, how can we estimate the corresponding rotation matrix R? 
Recall that the cross-product operator [€] x (2.32) projects a vector onto a set of orthogonal 
basis vectors that include f, zeros out the Ê component, and rotates the other two by 90°, 


1 0 -1 si 
[f]x = SZR S" = [so sı t] 1 1 0 sT|, (11.41) 
0 1| êT 


‘You can find an experimental comparison of a number of RANSAC variants at https://opencv.org/ 


evaluating-opencvs-new-ransacs/. 
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where t = sy x s¡. From Equations (11.40 and 11.41), we get 


E = [t],,R = SZR STR = UNV’, (11.42) 
from which we can conclude that S = U. Recall that for a noise-free essential matrix, 
(2 = Z), and hence 

Roo UTR = VT (11.43) 
and 
R = URZyo V7. (11.44) 


Unfortunately, we only know both E and t up to a sign. Furthermore, the matrices U and V 
are not guaranteed to be rotations (you can flip both their signs and still get a valid SVD). For 


this reason, we have to generate all four possible rotation matrices 


R = +UR {go VT (11.45) 


and keep the two whose determinant |R| = 1. To disambiguate between the remaining pair 


of potential rotations, which form a twisted pair (Hartley and Zisserman 2004, p. 259), we 


need to pair them with both possible signs of the translation direction +t and select the 
combination for which the largest number of points is seen in front of both cameras.'? 

The property that points must lie in front of the camera, i.e., at a positive distance along 
the viewing rays emanating from the camera, is known as cheirality (Hartley 1998). In addi- 
tion to determining the signs of the rotation and translation, as described above, the cheirality 
(sign of the distances) of the points in a reconstruction can be used inside a RANSAC proce- 
dure (along with the reprojection errors) to distinguish between likely and unlikely configu- 
rations.!* 


and 11.3.4) into quasi-affine reconstructions (Hartley 1998). 


cheirality can also be used to transform projective reconstructions (Sections 11.3.3 


11.3.2 Special motions and structures 


In certain situations, specially tailored algorithms can take advantage of known (or guessed) 


camera arrangements or 3D structures. 


12In the noise-free case, a single point suffices. It is safer, however, to test all or a sufficient subset of points, 
downweighting the ones that lie close to the plane at infinity, for which it is easy to get depth reversals. 
'3Note that as points get further away from a camera, i.e., closer toward the plane at infinity, errors in cheirality 


become more likely. 
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Figure 11.12 Pure translational camera motion results in visual motion where all the 
points move towards (or away from) a common focus of expansion (FOE) e. They there- 


fore satisfy the triple product condition (Xp, x1,e) = e: (Xo x X1) = 0. 


Pure translation (known rotation). In the case where we know the rotation, we can pre- 
rotate the points in the second image to match the viewing direction of the first. The resulting 
set of 3D points all move towards (or away from) the focus of expansion (FOE), as shown in 
Figure 11.12.'* The resulting essential matrix E is (in the noise-free case) skew symmetric 
and so can be estimated more directly by setting e;; = —eji and es; = 0 in (11.33). Two 
points with non-zero parallax now suffice to estimate the FOE. 
A more direct derivation of the FOE estimate can be obtained by minimizing the triple 
product 
S (xo, xi€)? = Y (o x Xi) -e)?, (11.46) 
i i 


which is equivalent to finding the null space for the set of equations 
(yio — yir Co + (Lar — Lio er + (TioYi1r — Yioti1)e2 = 0. (11.47) 


Note that, as in the eight-point algorithm, it is advisable to normalize the 2D points to have 
unit variance before computing this estimate. 

In situations where a large number of points at infinity are available, e.g., when shooting 
outdoor scenes or when the camera motion is small compared to distant objects, this suggests 
an alternative RANSAC strategy for estimating the camera motion. First, pick a pair of 
points to estimate a rotation, hoping that both of the points lie at infinity (very far from the 
camera). Then, compute the FOE and check whether the residual error is small (indicating 
agreement with this rotation hypothesis) and whether the motions towards or away from the 
epipole (FOE) are all in the same direction (ignoring very small motions, which may be 


noise-contaminated). 


l4Fans of Star Trek and Star Wars will recognize this as the “jump to hyperdrive” visual effect. 
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Pure rotation. The case of pure rotation results in a degenerate estimate of the essential 
matrix E and of the translation direction t. Consider first the case of the rotation matrix 
being known. The estimates for the FOE will be degenerate, because xj9g ~ x;1, and hence 
(11.47), is degenerate. A similar argument shows that the equations for the essential matrix 
(11.33) are also rank-deficient. 

This suggests that it might be prudent before computing a full essential matrix to first 
compute a rotation estimate R using (8.32), potentially with just a small number of points, 
and then compute the residuals after rotating the points before proceeding with a full E com- 


putation. 


Dominant planar structure. When a dominant plane is present in the scene, DEGENSAC, 
which tests whether too many correspondences are co-planar, can be used to recover the 
fundamental matrix more reliably than the seven-point algorithm (Chum, Werner, and Matas 
2005). 

As you can tell from the previous special cases, there exist many different specialized 
cases of two-frame structure-from-motion as well as many alternative appropriate techniques. 
The OpenGV library developed by Kneip and Furgale (2014) contains open-source imple- 


mentations of many of these algorithms.!° 


11.3.3 Projective (uncalibrated) reconstruction 


In many cases, such as when trying to build a 3D model from internet or legacy photos taken 
by unknown cameras without any EXIF tags, we do not know ahead of time the intrinsic 
calibration parameters associated with the input images. In such situations, we can still esti- 
mate a two-frame reconstruction, although the true metric structure may not be available, e.g., 
orthogonal lines or planes in the world may not end up being reconstructed as orthogonal. 
Consider the derivations we used to estimate the essential matrix E (11.30-11.32). In the 
uncalibrated case, we do not know the calibration matrices K;, so we cannot use the normal- 
ized ray directions &; = K; 'x;. Instead, we have access only to the image coordinates x;, 


and so the essential matrix equation (11.30) becomes 
x] Ex, = x" KI TEK} ‘xo = xj Fxo = 0, (11.48) 


where 
F=K  EK,* (11.49) 


IShttps://laurentkneip.github.io/opengv 
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is called the fundamental matrix (Faugeras 1992; Hartley, Gupta, and Chang 1992; Hartley 
and Zisserman 2004). 
Like the essential matrix, the fundamental matrix is (in principle) rank two, 


00 Vo 
F = USV? = [uo ú er] En vi |. (11.50) 
0} [es 


Its smallest left singular vector indicates the epipole e; in the image 1 and its smallest right 
singular vector is ey (Figure 11.11). The fundamental matrix can be factored into a skew- 


symmetric cross product matrix [e]. and a homography H, 


F = [e], A. (11.51) 


The homography H, which in principle from (11.49) should equal 
H =K;‘RK)', (11.52) 


cannot be uniquely recovered from F, as any homography of the form H’ = H+ev* results 
in the same F matrix. (Note that [e]. annihilates any multiple of e.) 

Any one of these valid homographies H maps some plane in the scene from one image 
to the other. It is not possible to tell in advance which one it is without either selecting four 
or more co-planar correspondences to compute H as part of the F estimation process (in a 
manner analogous to guessing a rotation for E) or mapping all points in one image through H 
and seeing which ones line up with their corresponding locations in the other. The resulting 
representation is often referred to as plane plus parallax (Kumar, Anandan, and Hanna 1994; 
Sawhney 1994) and is described in more detail in Section 2.1.4. 

To create a projective reconstruction of the scene, we can pick any valid homography 
H that satisfies Equation (11.49). For example, following a technique analogous to Equa- 
tions (11.40-11.44), we get 


F = [e], H = SZR STŘ = USVT (11.53) 


and hence 
H = URD ÈVT, (11.54) 


where Y is the singular value matrix with the smallest value replaced by a reasonable alter- 


native (say, the middle value).! We can then form a pair of camera matrices 


Po = [IJO] and  Po=[Hlel, (11.55) 


l6Hartley and Zisserman (2004, p. 256) recommend using H= [e] x F (Luong and Viéville 1996), which places 


the camera on the plane at infinity. 
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from which a projective reconstruction of the scene can be computed using triangulation 
(Section 11.2.4). 

While the projective reconstruction may not be useful on its own, it can often be upgraded 
to an affine or metric reconstruction, as described below. Even without this step, however, 
the fundamental matrix F can be very useful in finding additional correspondences, as they 
must all lie on corresponding epipolar lines, i.e., any feature xy in image O must have its 
correspondence lying on the associated epipolar line lı = Fxo in image 1, assuming that the 


point motions are due to a rigid transformation. 


11.3.4 Self-calibration 


The results of structure from motion computation are much more useful if a metric recon- 
struction is obtained, i.e., one in which parallel lines are parallel, orthogonal walls are at right 
angles, and the reconstructed model is a scaled version of reality. Over the years, a large num- 
ber of self-calibration (or auto-calibration) techniques have been developed for converting a 
projective reconstruction into a metric one, which is equivalent to recovering the unknown 
calibration matrices K; associated with each image (Hartley and Zisserman 2004; Moons, 
Van Gool, and Vergauwen 2010). 

In situations where additional information is known about the scene, different methods 
may be employed. For example, if there are parallel lines in the scene, three or more vanishing 
points, which are the images of points at infinity, can be used to establish the homography for 
the plane at infinity, from which focal lengths and rotations can be recovered. If two or more 
finite orthogonal vanishing points have been observed, the single-image calibration method 
based on vanishing points (Section 11.1.1) can be used instead. 

In the absence of such external information, it is not possible to recover a fully parameter- 
ized independent calibration matrix K; for each image from correspondences alone. To see 
this, consider the set of all camera matrices P; = K,[R,|t,] projecting world coordinates 
Pi = (Xi, Yi, Zi, Wi) into screen coordinates Xij ~ P,;p;. Now consider transforming the 
3D scene {p;} through an arbitrary 4 x 4 projective transformation H, yielding a new model 
consisting of points p; = Hp,. Post-multiplying each P; matrix by H-! still produces the 
same screen coordinates and a new set calibration matrices can be computed by applying RQ 
decomposition to the new camera matrix P; =P jA. 

For this reason, all self-calibration methods assume some restricted form of the calibration 
matrix, either by setting or equating some of their elements or by assuming that they do not 
vary over time. While most of the techniques discussed by Hartley and Zisserman (2004); 
Moons, Van Gool, and Vergauwen (2010) require three or more frames, in this section we 


present a simple technique that can recover the focal lengths (fo, f1) of both images from the 
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fundamental matrix F in a two-frame reconstruction (Hartley and Zisserman 2004, p. 472). 

To accomplish this, we assume that the camera has zero skew, a known aspect ratio (usu- 
ally set to 1), and a known image center, as in Equation (2.59). How reasonable is this 
assumption in practice? The answer, as with many questions, is “it depends”. 

If absolute metric accuracy is required, as in photogrammetry applications, itis imperative 
to pre-calibrate the cameras using one of the techniques from Section 11.1 and to use ground 
control points to pin down the reconstruction. If instead, we simply wish to reconstruct the 
world for visualization or image-based rendering applications, as in the Photo Tourism system 
of Snavely, Seitz, and Szeliski (2006), this assumption is quite reasonable in practice. 

Most cameras today have square pixels and an image center near the middle of the image, 
and are much more likely to deviate from a simple camera model due to radial distortion 
(Section 11.1.4), which should be compensated for whenever possible. The biggest problems 
occur when images have been cropped off-center, in which case the image center will no 
longer be in the middle, or when perspective pictures have been taken of a different picture, 
in which case a general camera matrix becomes necessary.!” 

Given these caveats, the two-frame focal length estimation algorithm based on the Kruppa 
equations developed by Hartley and Zisserman (2004, p. 456) proceeds as follows. Take the 
left and right singular vectors {uo, u1, Vo, Vi} of the fundamental matrix F (11.50) and their 


associated singular values {9,01} and form the following set of equations: 


ul Dou: ug Dou: e ug Douo (1 56) 
ogv Divo oooive Divi o?v? Divi f f 
where the two matrices 
$ 
D; = K;K; = diag(f?, f7,1) = f (11.57) 


encode the unknown focal lengths. For simplicity, let us rewrite each of the numerators and 


denominators in (11.56) as 
eijol fi) = uj; Dou; = aij + bij fê, (11.58) 
eij (f?) = oiozjv? Divy = cij + diz fz. (11.59) 


Notice that each of these is affine (linear plus constant) in either fj or f?. Hence, we can 


cross-multiply these equations to obtain quadratic equations in f?, which can readily be 


17In Photo Tourism, our system registered photographs of an information sign outside Notre Dame with real 


pictures of the cathedral. 
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solved. (See also the work by Bougnoux (1998) and Kanatani and Matsunaga (2000) for 
some alternative formulations.) 

An alternative solution technique is to observe that we have a set of three equations related 
by an unknown scalar A, i.e., 


eijo(fo) = Aei (fP) (11.60) 


(Richard Hartley, personal communication, July 2009). These can readily be solved to yield 
(fe, Af?, A) and hence (fo, f1). 

How well does this approach work in practice? There are certain degenerate configura- 
tions, such as when there is no rotation or when the optical axes intersect, when it does not 
work at all. (In such a situation, you can vary the focal lengths of the cameras and obtain 
a deeper or shallower reconstruction, which is an example of a bas-relief ambiguity (Sec- 
tion 11.4.5).) Hartley and Zisserman (2004) recommend using techniques based on three 
or more frames. However, if you find two images for which the estimates of (fê, Af?, A) 
are well conditioned, they can be used to initialize a more complete bundle adjustment of 
all the parameters (Section 11.4.2). An alternative, which is often used in systems such as 
Photo Tourism, is to use camera EXIF tags or generic default values to initialize focal length 
estimates and refine them as part of bundle adjustment. 


11.3.5 Application: View morphing 


An interesting application of basic two-frame structure from motion is view morphing (also 
known as view interpolation, see Section 14.1), which can be used to generate a smooth 3D 
animation from one view of a 3D scene to another (Chen and Williams 1993; Seitz and Dyer 
1996). 

To create such a transition, you must first smoothly interpolate the camera matrices, i.e., 
the camera positions, orientations, and focal lengths. While simple linear interpolation can be 
used (representing rotations as quaternions (Section 2.1.3)), a more pleasing effect is obtained 
by easing in and easing out the camera parameters, e.g., using a raised cosine, as well as 
moving the camera along a more circular trajectory (Snavely, Seitz, and Szeliski 2006). 

To generate in-between frames, either a full set of 3D correspondences needs to be es- 
tablished (Section 12.3) or 3D models (proxies) must be created for each reference view. 
Section 14.1 describes several widely used approaches to this problem. One of the simplest 
1s to just triangulate the set of matched feature points in each image, e.g., using Delaunay 
triangulation. As the 3D points are re-projected into their intermediate views, pixels can be 
mapped from their original source images to their new views using affine or projective map- 


ping (Szeliski and Shum 1997). The final image is then composited using a linear blend of 
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(b) (c) 


Figure 11.13 3D reconstruction of a rotating ping pong ball using factorization (Tomasi 
and Kanade 1992) © 1992 Springer: (a) sample image with tracked features overlaid; (b) 
subsampled feature motion stream; (c) two views of the reconstructed 3D model. 


the two reference images, as with usual morphing (Section 3.6.3). 


11.4 Multi-frame structure from motion 


While two-frame techniques are useful for reconstructing sparse geometry from stereo image 
pairs and for initializing larger-scale 3D reconstructions, most applications can benefit from 
the much larger number of images that are usually available in photo collections and videos 
of scenes. 

In this section, we briefly review an older technique called factorization, which can pro- 
vide useful solutions for short video sequences, and then turn to the more commonly used 
bundle adjustment approach, which uses non-linear least squares to obtain optimal solutions 


under general camera configurations. 


11.4.1 Factorization 


When processing video sequences, we often get extended feature tracks (Section 7.1.5) from 
which it is possible to recover the structure and motion using a process called factorization. 
Consider the tracks generated by a rotating ping pong ball, which has been marked with dots 
to make its shape and motion more discernable (Figure 11.13). We can readily see from 
the shape of the tracks that the moving object must be a sphere, but how can we infer this 
mathematically? 

It turns out that, under orthography or related models we discuss below, the shape and 


motion can be recovered simultaneously using a singular value decomposition (Tomasi and 
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Kanade 1992). The details of how to do this are presented in the paper by Tomasi and Kanade 
(1992) and also in the first edition of this book (Szeliski 2010, Section 7.3). 

Once the rotation matrices and 3D point locations have been recovered, there still exists 
a bas-relief ambiguity, i.e., we can never be sure if the object is rotating left to right or if 
its depth reversed version is moving the other way. (This can be seen in the classic rotating 
Necker Cube visual illusion.) Additional cues, such as the appearance and disappearance of 
points, or perspective effects, both of which are discussed below, can be used to remove this 
ambiguity. 

For motion models other than pure orthography, e.g., for scaled orthography or para- 
perspective, the approach above must be extended in the appropriate manner. Such tech- 
niques are relatively straightforward to derive from first principles; more details can be found 
in papers that extend the basic factorization approach to these more flexible models (Poel- 
man and Kanade 1997). Additional extensions of the original factorization algorithm include 
multi-body rigid motion (Costeira and Kanade 1995), sequential updates to the factorization 
(Morita and Kanade 1997), the addition of lines and planes (Morris and Kanade 1998), and 
re-scaling the measurements to incorporate individual location uncertainties (Anandan and 
Trani 2002). 

A disadvantage of factorization approaches is that they require a complete set of tracks, 
1.e., each point must be visible in each frame, for the factorization approach to work. Tomasi 
and Kanade (1992) deal with this problem by first applying factorization to smaller denser 
subsets and then using known camera (motion) or point (structure) estimates to hallucinate 
additional missing values, which allows them to incrementally incorporate more features and 
cameras. Huynh, Hartley, and Heyden (2003) extend this approach to view missing data as 
special cases of outliers. Buchanan and Fitzgibbon (2005) develop fast iterative algorithms 
for performing large matrix factorizations with missing data. The general topic of principal 
component analysis (PCA) with missing data also appears in other computer vision problems 
(Shum, Ikeuchi, and Reddy 1995; De la Torre and Black 2003; Gross, Matthews, and Baker 
2006; Torresani, Hertzmann, and Bregler 2008; Vidal, Ma, and Sastry 2016). 


Perspective and projective factorization 


Another disadvantage of regular factorization is that it cannot deal with perspective cameras. 
One way to get around this problem is to perform an initial affine (e.g., orthographic) recon- 
struction and to then correct for the perspective effects in an iterative manner (Christy and 
Horaud 1996). This algorithm usually converges in three to five iterations, with the majority 
of the time spent in the SVD computation. 


An alternative approach, which does not assume partially calibrated cameras (known im- 
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age center, square pixels, and zero skew) is to perform a fully projective factorization (Sturm 
and Triggs 1996; Triggs 1996). In this case, the inclusion of the third row of the camera ma- 
trix in the measurement matrix is equivalent to multiplying each reconstructed measurement 
Xji = M;p; by its inverse (projective) depth nj; = dei = 1/(P;2p;) or, equivalently, multi- 
plying each measured position by its projective depth d;;. In the original paper by Sturm and 
Triggs (1996), the projective depths dj; are obtained from two-frame reconstructions, while 
in later work (Triggs 1996; Oliensis and Hartley 2007), they are initialized to dj; = 1 and 
updated after each iteration. Oliensis and Hartley (2007) present an update formula that is 
guaranteed to converge to a fixed point. None of these authors suggest actually estimating the 
third row of P; as part of the projective depth computations. In any case, it is unclear when a 
fully projective reconstruction would be preferable to a partially calibrated one, especially if 
they are being used to initialize a full bundle adjustment of all the parameters. 

One of the attractions of factorization methods is that they provide a “closed form” (some- 
times called a “linear”) method to initialize iterative techniques such as bundle adjustment. 
An alternative initialization technique is to estimate the homographies corresponding to some 
common plane seen by all the cameras (Rother and Carlsson 2002). In a calibrated camera 
setting, this can correspond to estimating consistent rotations for all of the cameras, for ex- 
ample, using matched vanishing points (Antone and Teller 2002). Once these have been 
recovered, the camera positions can then be obtained by solving a linear system (Antone and 
Teller 2002; Rother and Carlsson 2002; Rother 2003). 


11.4.2 Bundle adjustment 


As we have mentioned several times before, the most accurate way to recover structure and 
motion is to perform robust non-linear minimization of the measurement (re-projection) er- 
rors, which is commonly known in the photogrammetry (and now computer vision) communi- 


ties as bundle adjustment.'® 


Triggs, McLauchlan et al. (1999) provide an excellent overview 
of this topic, including its historical development, pointers to the photogrammetry literature 
(Slama 1980; Atkinson 1996; Kraus 1997), and subtle issues with gauge ambiguities. The 
topic is also treated in depth in textbooks and surveys on multi-view geometry (Faugeras and 
Luong 2001; Hartley and Zisserman 2004; Moons, Van Gool, and Vergauwen 2010). 

We have already introduced the elements of bundle adjustment in our discussion on it- 


erative pose estimation (Section 11.2.2), i.e., Equations (11.14-11.20) and Figure 11.7. The 


'8The term “bundle” refers to the bundles of rays connecting camera centers to 3D points and the term “adjust- 
ment” refers to the iterative minimization of re-projection error. Alternative terms for this in the vision community 
include optimal motion estimation (Weng, Ahuja, and Huang 1993) and non-linear least squares (Appendix A.3) 
(Taylor, Kriegman, and Anandan 1991; Szeliski and Kang 1994). 
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Figure 11.14 A set of chained transforms for projecting a 3D point p; into a 2D mea- 
surement x;; through a series of transformations £), each of which is controlled by its own 
set of parameters. The dashed lines indicate the flow of information as partial derivatives 
are computed during a backward pass. The formula for the radial distortion function is 
frap(x) = (1 + «ir? + Kor*)x. 


biggest difference between these formulas and full bundle adjustment is that our feature lo- 
cation measurements x;; now depend not only on the point (track) index ¿ but also on the 
camera pose index 7, 

Xij = f(pi,R;,c;,K;), (11.61) 


and that the 3D point positions p; are also being simultaneously updated. In addition, it is 
common to add a stage for radial distortion parameter estimation (2.78), 


frp(x) = (1 + kır? + Ker*)x, (11.62) 


if the cameras being used have not been pre-calibrated, as shown in Figure 11.14. 

While most of the boxes (transforms) in Figure 11.14 have previously been explained 
(11.19), the leftmost box has not. This box performs a robust comparison of the predicted 
and measured 2D locations X;; and X;¿ after re-scaling by the measurement noise covariance 


>¿¡. In more detail, this operation can be written as 


rij = Xij — iy, (11.63) 
2 > aT 

s =r E rys (11.64) 
eij = P(sig), (11.65) 


where f(r?) = p(r). The corresponding Jacobians (partial derivatives) can be written as 


deij 2 

TI = ĵ' (s3) (11.66) 
A 

Oe. 
$= E; Tij. (11.67) 


OX; 
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Figure 11.15 A camera rig and its associated transform chain. (a) As the mobile rig (robot) 
moves around in the world, its pose with respect to the world at time t is captured by (Rj, c;). 
Each camera’s pose with respect to the rig is captured by (RS, c$). (b) A 3D point with world 
coordinates př is first transformed into rig coordinates p;, and then through the rest of the 


camera-specific chain, as shown in Figure 11.14. 


The advantage of the chained representation introduced above is that it not only makes 
the computations of the partial derivatives and Jacobians simpler but it can also be adapted 
to any camera configuration. Consider for example a pair of cameras mounted on a robot 
that is moving around in the world, as shown in Figure 11.15a. By replacing the rightmost 
two transformations in Figure 11.14 with the transformations shown in Figure 11.15b, we 
can simultaneously recover the position of the robot at each time and the calibration of each 
camera with respect to the rig, in addition to the 3D structure of the world. 


11.4.3 Exploiting sparsity 


Large bundle adjustment problems, such as those involving reconstructing 3D scenes from 
thousands of internet photographs (Snavely, Seitz, and Szeliski 2008b; Agarwal, Furukawa 
et al. 2010, 2011; Snavely, Simon et al. 2010), can require solving non-linear least squares 
problems with millions of measurements (feature matches) and tens of thousands of unknown 


parameters (3D point positions and camera poses). Unless some care is taken, these kinds of 
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Figure 11.16 (a) Bipartite graph for a toy structure from motion problem and (b) its 
associated Jacobian J and (c) Hessian A. Numbers indicate 3D points and letters indicate 
cameras. The dashed arcs and light blue squares indicate the fill-in that occurs when the 
structure (point) variables are eliminated. 


problem can become intractable, because the (direct) solution of dense least squares problems 


is cubic in the number of unknowns. 


Fortunately, structure from motion is a bipartite problem in structure and motion. Each 
feature point x;; in a given image depends on one 3D point position p; and one 3D camera 
pose (R;,c;). This is illustrated in Figure 11.16a, where each circle (1-9) indicates a 3D 
point, each square (A—D) indicates a camera, and lines (edges) indicate which points are 
visible in which cameras (2D features). If the values for all the points are known or fixed, the 


equations for all the cameras become independent, and vice versa. 


If we order the structure variables before the motion variables in the Hessian matrix A 
(and hence also the right-hand side vector b), we obtain a structure for the Hessian shown 
in Figure 11.16c.!? When such a system is solved using sparse Cholesky factorization (see 
Appendix A.4) (Bjórck 1996; Golub and Van Loan 1996), the fill-in occurs in the smaller 
motion Hessian Ace (Szeliski and Kang 1994; Triggs, McLauchlan et al. 1999; Hartley and 
Zisserman 2004; Lourakis and Argyros 2009; Engels, Stewénius, and Nistér 2006). More 
recent papers (Byród and Ástróm 2009; Jeong, Nistér et al. 2010; Agarwal, Snavely et al. 
2010; Jeong, Nistér et al. 2012) explore the use of iterative (conjugate gradient) techniques 
for the solution of bundle adjustment problems. Other papers explore the use of parallel 
multicore algorithms (Wu, Agarwal et al. 2011). 


19This ordering is preferable when there are fewer cameras than 3D points, which is the usual case. The exception 
is when we are tracking a small number of points through many video frames, in which case this ordering should be 


reversed. 
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In more detail, the reduced motion Hessian is computed using the Schur complement, 
Aco = Acc — ApcAppApo, (11.68) 


where App is the point (structure) Hessian (the top left block of Figure 11.16c), Apc is the 
point-camera Hessian (the top right block), and Acc and A are the motion Hessians before 
and after the point variable elimination (the bottom right block of Figure 11.16c). Notice that 
Ac has a non-zero entry between two cameras if they see any 3D point in common. This is 
indicated with dashed arcs in Figure 11.16a and light blue squares in Figure 11.16c. 

Whenever there are global parameters present in the reconstruction algorithm, such as 
camera intrinsics that are common to all of the cameras, or camera rig calibration parameters 
such as those shown in Figure 11.15, they should be ordered last (placed along the right and 
bottom edges of A) to reduce fill-in. 

Engels, Stewénius, and Nistér (2006) provide a nice recipe for sparse bundle adjustment, 
including all the steps needed to initialize the iterations, as well as typical computation times 
for a system that uses a fixed number of backward-looking frames in a real-time setting. They 
also recommend using homogeneous coordinates for the structure parameters p;, which is a 
good idea, as it avoids numerical instabilities for points near infinity. 

Bundle adjustment is now the standard method of choice for most structure-from-motion 
problems and is commonly applied to problems with hundreds of weakly calibrated images 
and tens of thousands of points. (Much larger problems are commonly solved in photogram- 
metry and aerial imagery, but these are usually carefully calibrated and make use of surveyed 
ground control points.) However, as the problems become larger, it becomes impractical to 
re-solve full bundle adjustment problems at each iteration. 

One approach to dealing with this problem is to use an incremental algorithm, where new 
cameras are added over time. (This makes particular sense if the data is being acquired from 
a video camera or moving vehicle (Nistér, Naroditsky, and Bergen 2006; Pollefeys, Nistér ef 
al. 2008).) A Kalman filter can be used to incrementally update estimates as new information 
is acquired. Unfortunately, such sequential updating is only statistically optimal for linear 
least squares problems. 

For non-linear problems such as structure from motion, an extended Kalman filter, which 
linearizes measurement and update equations around the current estimate, needs to be used 
(Gelb 1974; Viéville and Faugeras 1990). To overcome this limitation, several passes can 
be made through the data (Azarbayejani and Pentland 1995). Because points disappear from 
view (and old cameras become irrelevant), a variable state dimension filter (VSDF) can be 
used to adjust the set of state variables over time, for example, by keeping only cameras and 


point tracks seen in the last k frames (McLauchlan 2000). A more flexible approach to using 
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a fixed number of frames is to propagate corrections backwards through points and cameras 
until the changes on parameters are below a threshold (Steedly and Essa 2001). Variants of 
these techniques, including methods that use a fixed window for bundle adjustment (Engels, 
Stewénius, and Nistér 2006) or select keyframes for doing full bundle adjustment (Klein and 
Murray 2008) are now commonly used in simultaneous localization and mapping (SLAM) 
and augmented-reality applications, as discussed in Section 11.5. 

When maximum accuracy is required, it is still preferable to perform a full bundle adjust- 
ment over all the frames. To control the resulting computational complexity, one approach is 
to lock together subsets of frames into locally rigid configurations and to optimize the rela- 
tive positions of these cluster (Steedly, Essa, and Dellaert 2003). A different approach is to 
select a smaller number of frames to form a skeletal set that still spans the whole dataset and 
produces reconstructions of comparable accuracy (Snavely, Seitz, and Szeliski 2008b). We 
describe this latter technique in more detail in Section 11.4.6, where we discuss applications 
of structure from motion to large image sets. Additional techniques for efficiently solving 
large structure from motion and SLAM systems can be found in the survey by Dellaert and 
Kaess (2017); Dellaert (2021). 

While bundle adjustment and other robust non-linear least squares techniques are the 
methods of choice for most structure-from-motion problems, they suffer from initialization 
problems, i.e., they can get stuck in local energy minima if not started sufficiently close 
to the global optimum. Many systems try to mitigate this by being conservative in what 
reconstruction they perform early on and which cameras and points they add to the solution 
(Section 11.4.6). An alternative, however, is to re-formulate the problem using a norm that 
supports the computation of global optima. 

Kahl and Hartley (2008) describe techniques for using Loo norms in geometric recon- 
struction problems. The advantage of such norms is that globally optimal solutions can be 
efficiently computed using second-order cone programming (SOCP). The disadvantage is that 
Lo norms are particularly sensitive to outliers and so must be combined with good outlier 
rejection techniques before they can be used. 

A large number of high-quality open source bundle adjustment packages have been de- 
veloped, including the Ceres Solver, Multicore Bundle Adjustment (Wu, Agarwal et al. 
2011),?! the Sparse Levenberg-Marquardt based non-linear least squares optimizer and bun- 
dle adjuster,” and OpenSfM.? You can find more pointers to open-source software in Ap- 
pendix Appendix C.2 and reviews of open-source and commercial photogrammetry soft- 


2http://ceres-solver.org 

2 https://grail.cs.washington.edu/projects/mcba 
2https://github.com/chzach/SSBA 

3 https://www.opensfm.org 
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ware?* as well as examples of their application?” on the web. 


11.4.4 Application: Match move 


One of the neatest applications of structure from motion is to estimate the 3D motion of 
a video or film camera, along with the geometry of a 3D scene, in order to superimpose 3D 
graphics or computer-generated images (CGI) on the scene. In the visual effects industry, this 
1s known as the match move problem (Roble 1999), as the motion of the synthetic 3D camera 
used to render the graphics must be matched to that of the real-world camera. For very small 
motions, or motions involving pure camera rotations, one or two tracked points can suffice 
to compute the necessary visual motion. For planar surfaces moving in 3D, four points are 
needed to compute the homography, which can then be used to insert planar overlays, e.g., to 
replace the contents of advertising billboards during sporting events. 

The general version of this problem requires the estimation of the full 3D camera pose 
along with the focal length (zoom) of the lens and potentially its radial distortion parameters 
(Roble 1999). When the 3D structure of the scene is known ahead of time, pose estima- 
tion techniques such as view correlation (Bogart 1991) or through-the-lens camera control 
(Gleicher and Witkin 1992) can be used, as described in Section 11.4.4. 

For more complex scenes, it is usually preferable to recover the 3D structure simultane- 
ously with the camera motion using structure-from-motion techniques. The trick with using 
such techniques is that to prevent any visible jitter between the synthetic graphics and the 
actual scene, features must be tracked to very high accuracy and ample feature tracks must 
be available in the vicinity of the insertion location. Some of today’s best known match 
move software packages, such as the boujou package from 2d3, which won an Emmy award 
in 2002, originated in structure-from-motion research in the computer vision community 
(Fitzgibbon and Zisserman 1998). 


11.4.5 Uncertainty and ambiguities 


Because structure from motion involves the estimation of so many highly coupled parameters, 
often with no known “ground truth” components, the estimates produced by structure from 
motion algorithms can often exhibit large amounts of uncertainty (Szeliski and Kang 1997; 


Wilson and Wehrwein 2020). An example of this is the classic bas-relief ambiguity, which 


4https://peterfalkingham.com/2020/07/10/free-and-commercial-photogrammetry-software-review-2020 
25 https://beforesandafters.com/2020/07/06/tales-from-on-set-lidar-scanning-for-joker-and-john-wick-3, https:// 


rd.nytimes.com/projects/reconstructing-journalistic-scenes-in-3d 
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makes it hard to simultaneously estimate the 3D depth of a scene and the amount of camera 
motion (Oliensis 2005).7° 


As mentioned before, a unique coordinate frame and scale for a reconstructed scene can- 
not be recovered from monocular visual measurements alone. (When a stereo rig is used, the 
scale can be recovered if we know the distance (baseline) between the cameras.) This seven- 
degree-of-freedom (coordinate frame and scale) gauge ambiguity makes it tricky to compute 
the covariance matrix associated with a 3D reconstruction (Triggs, McLauchlan et al. 1999; 
Kanatani and Morris 2001). A simple way to compute a covariance matrix that ignores the 
gauge freedom (indeterminacy) is to throw away the seven smallest eigenvalues of the infor- 
mation matrix (inverse covariance), whose values are equivalent to the problem Hessian A up 
to noise scaling (see Section 8.1.4 and Appendix B.6). After we do this, the resulting matrix 


can be inverted to obtain an estimate of the parameter covariance. 


Szeliski and Kang (1997) use this approach to visualize the largest directions of variation 
in typical structure from motion problems. Not surprisingly, they find that, ignoring the gauge 
freedoms, the greatest uncertainties for problems such as observing an object from a small 
number of nearby viewpoints are in the depths of the 3D structure relative to the extent of the 


: 27 
camera motion. 


It is also possible to estimate local or marginal uncertainties for individual parameters, 
which corresponds simply to taking block sub-matrices from the full covariance matrix. Un- 
der certain conditions, such as when the camera poses are relatively certain compared to 3D 
point locations, such uncertainty estimates can be meaningful. However, in many cases, indi- 
vidual uncertainty measures can mask the extent to which reconstruction errors are correlated, 
which is why looking at the first few modes of greatest joint variation can be helpful. 

The other way in which gauge ambiguities affect structure from motion and, in particular, 
bundle adjustment is that they make the system Hessian matrix A rank-deficient and hence 
impossible to invert. A number of techniques have been proposed to mitigate this problem 
(Triggs, McLauchlan et al. 1999; Bartoli 2003). In practice, however, it appears that simply 
adding a small amount of the Hessian diagonal Adiag( A) to the Hessian A itself, as is done in 
the Levenberg-Marquardt non-linear least squares algorithm (Appendix A.3), usually works 


well. 


6Bas-relief refers to a kind of sculpture in which objects, often on ornamental friezes, are sculpted with less 
depth than they actually occupy. When lit from above by sunlight, they appear to have true 3D depth because of the 
ambiguity between relative depth and the angle of the illuminant (Section 13.1.1). 

27 A good way to minimize the amount of such ambiguities is to use wide field of view cameras (Antone and Teller 
2002; Levin and Szeliski 2006). 
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Figure 11.17 Incremental structure from motion (Snavely, Seitz, and Szeliski 2006) © 2006 
ACM. Starting with an initial two-frame reconstruction of Trevi Fountain, batches of images 
are added using pose estimation, and their positions (along with the 3D model) are refined 


using bundle adjustment. 


11.4.6 Application: Reconstruction from internet photos 


The most widely used application of structure from motion is in the reconstruction of 3D 
objects and scenes from video sequences and collections of images (Pollefeys and Van Gool 
2002). The last two decades have seen an explosion of techniques for performing this task 
automatically without the need for any manual correspondence or pre-surveyed ground con- 
trol points. A lot of these techniques assume that the scene is taken with the same camera and 
hence the images all have the same intrinsics (Fitzgibbon and Zisserman 1998; Koch, Polle- 
feys, and Van Gool 2000; Schaffalitzky and Zisserman 2002; Tuytelaars and Van Gool 2004; 
Pollefeys, Nistér et al. 2008; Moons, Van Gool, and Vergauwen 2010). Many of these tech- 
niques take the results of the sparse feature matching and structure from motion computation 
and then compute dense 3D surface models using multi-view stereo techniques (Section 12.7) 
(Koch, Pollefeys, and Van Gool 2000; Pollefeys and Van Gool 2002; Pollefeys, Nistér et al. 
2008; Moons, Van Gool, and Vergauwen 2010; Schönberger, Zheng et al. 2016). 

An exciting innovation in this space has been the application of structure from motion and 
multi-view stereo techniques to thousands of images taken from the internet, where very little 
is known about the cameras taking the photographs (Snavely, Seitz, and Szeliski 2008a). Be- 
fore the structure from motion computation can begin, it is first necessary to establish sparse 
correspondences between different pairs of images and to then link such correspondences 
into feature tracks, which associate individual 2D image features with global 3D points. Be- 
cause the O(N?) comparison of all pairs of images can be very slow, a number of techniques 
have been developed in the recognition community to make this process faster (Section 7.1.4) 
(Nistér and Stewénius 2006; Philbin, Chum et al. 2008; Li, Wu et al. 2008; Chum, Philbin, 
and Zisserman 2008; Chum and Matas 2010a; Arandjelovié and Zisserman 2012). 
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(b) 


Figure 11.18 3D reconstructions produced by the incremental structure from motion algo- 
rithm developed by Snavely, Seitz, and Szeliski (2006) O 2006 ACM: (a) cameras and point 
cloud from Trafalgar Square; (b) cameras and points overlaid on an image from the Great 
Wall of China; (c) overhead view of a reconstruction of the Old Town Square in Prague 


registered to an aerial photograph. 


To begin the reconstruction process, it is important to select a good pair of images, where 
there are both a large number of consistent matches (to lower the likelihood of incorrect 
correspondences) and a significant amount of out-of-plane parallax,” to ensure that a stable 
reconstruction can be obtained (Snavely, Seitz, and Szeliski 2006). The EXIF tags associated 
with the photographs can be used to get good initial estimates for camera focal lengths, al- 
though this is not always strictly necessary, because these parameters are re-adjusted as part 
of the bundle adjustment process. 

Once an initial pair has been reconstructed, the pose of cameras that see a sufficient num- 
ber of the resulting 3D points can be estimated (Section 11.2) and the complete set of cameras 
and feature correspondences can be used to perform another round of bundle adjustment. Fig- 
ure 11.17 shows the progression of the incremental bundle adjustment algorithm, where sets 
of cameras are added after each successive round of bundle adjustment, while Figure 11.18 
shows some additional results. An alternative to this kind of seed and grow approach is to 
first reconstruct triplets of images and then hierarchically merge them into larger collections 
(Fitzgibbon and Zisserman 1998). 

Unfortunately, as the incremental structure from motion algorithm continues to add more 
cameras and points, it can become extremely slow. The direct solution of a dense system 
of O(N) equations for the camera pose updates can take O(N?) time; while structure from 
motion problems are rarely dense, scenes such as city squares have a high percentage of 


28 A simple way to compute this is to robustly fit a homography to the correspondences and measure reprojection 


errors. 
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Figure 11.19 Large-scale structure from motion using skeletal sets (Snavely, Seitz, and 
Szeliski 2008b) © 2008 IEEE: (a) original match graph for 784 images; (b) skeletal set 
containing 101 images; (c) top-down view of scene (Pantheon) reconstructed from the skeletal 
set; (d) reconstruction after adding in the remaining images using pose estimation; (e) final 


bundle adjusted reconstruction, which is almost identical. 


cameras that see points in common. Re-running the bundle adjustment algorithm after every 
few camera additions results in a quartic scaling of the run time with the number of images 
in the dataset. One approach to solving this problem is to select a smaller number of images 


for the original scene reconstruction and to fold in the remaining images at the very end. 


Snavely, Seitz, and Szeliski (2008b) develop an algorithm for computing such a skele- 
tal set of images, which is guaranteed to produce a reconstruction whose error is within a 
bounded factor of the optimal reconstruction accuracy. Their algorithm first evaluates all 
pairwise uncertainties (position covariances) between overlapping images and then chains 
them together to estimate a lower bound for the relative uncertainty of any distant pair. The 
skeletal set is constructed so that the maximal uncertainty between any pair grows by no 
more than a constant factor. Figure 11.19 shows an example of the skeletal set computed for 
784 images of the Pantheon in Rome. As you can see, even though the skeletal set contains 
just a fraction of the original images, the shapes of the skeletal set and full bundle adjusted 


reconstructions are virtually indistinguishable. 


Since the initial publication on large-scale internet photo reconstruction by Snavely, Seitz, 
and Szeliski (2008a,b), there have been a large number of follow-on papers exploring even 
larger datasets and more efficient algorithms (Agarwal, Furukawa et al. 2010, 2011; Frahm, 
Fite-Georgel et al. 2010; Wu 2013; Heinly, Schönberger et al. 2015; Schönberger and Frahm 
2016). Among these, the COLMAP open source structure from motion and multi-view stereo 


system is currently one of the most widely used, as it can reconstruct extremely large scenes, 
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(b) 


Figure 11.20 Large-scale reconstructions created with the COLMAP structure from mo- 
tion and multi-view stereo system: (a) sparse model of central Rome constructed from 21K 
photos (Schónberger and Frahm 2016) O 2016 IEEE; (b) dense models of several landmarks 
produced with the MVS pipeline (Schónberger, Zheng et al. 2016) O 2016 Springer. 


such as the one shown in Figure 11.20 (Schönberger and Frahm 2016). 

The ability to automatically reconstruct 3D models from large, unstructured image col- 
lections has also brought to light subtle problems with traditional structure from motion al- 
gorithms, including the need to deal with repetitive and duplicate structures (Wu, Frahm, and 
Pollefeys 2010; Roberts, Sinha et al. 2011; Wilson and Snavely 2013; Heinly, Dunn, and 
Frahm 2014) as well as dynamic visual objects such as people (Ji, Dunn, and Frahm 2014; 
Zheng, Wang et al. 2014). It has also opened up a wide variety of additional applications, 
including the ability to automatically find and label locations and regions of interest (Simon, 
Snavely, and Seitz 2007; Simon and Seitz 2008; Gammeter, Bossard et al. 2009) and to cluster 
large image collections so that they can be automatically labeled (Li, Wu et al. 2008; Quack, 
Leibe, and Van Gool 2008). Some additional applications related to image-based rendering 
are discussed in more detail in Section 14.1.2. 


11.4.7 Global structure from motion 


While incremental bundle adjustment algorithms are still the most commonly used approaches 
for large-scale reconstruction (Schónberger and Frahm 2016), they can be quite slow because 


2 https://colmap.github.io 
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Figure 11.21 Global structure from motion pipeline from Sinha, Steedly, and Szeliski 
(2010) © 2010 Springer. Vanishing point and feature-based pairwise rotation estimates are 
used to first estimate a globally consistent set of orientations (rotations). The scales of all 
pairwise reconstructions along with the camera center positions are then estimated in a single 


linear least squares minimization. 


of the need to successively solve increasing larger optimization problems. An alternative to 
iteratively growing the solution is to solve for all of the structure and motion unknowns in a 
single global step, once the feature correspondences have been established. 

One approach to this is to set up a linear system of equations that relate all of the camera 
centers and 3D point, line, and plane equations to the known 2D feature or line positions 
(Kaucic, Hartley, and Dano 2001; Rother 2003). However, these approaches require a refer- 
ence plane (e.g., building wall) to be visible and matched in all images, and are also sensitive 
to distant points, which must first be discarded. These approaches, while theoretically inter- 
esting, are not widely used. 

A second approach, first proposed by Govindu (2001), starts by computing pairwise Eu- 
clidean structure and motion reconstructions using the techniques discussed in Section 11.3.3 
Pairwise rotation estimates are then used to compute a globally consistent orientation estimate 
for each camera, using a process known as rotation averaging (Govindu 2001; Martinec and 
Pajdla 2007; Chatterjee and Govindu 2013; Hartley, Trumpf et al. 2013; Dellaert, Rosen et al. 
2020).*! In a final step, the camera positions are determined by scaling each of the local cam- 
era translations, after they have been rotated into a global coordinate system (Govindu 2001, 
2004; Martinec and Pajdla 2007; Sinha, Steedly, and Szeliski 2010). In the robotics (SLAM) 
community, this last step is called pose graph optimization (Carlone, Tron et al. 2015). 

Figure 11.21 shows a more recent pipeline implementing this concept, which includes the 
initial feature point extraction, matching, and two-view reconstruction, followed by global 
rotation estimation, and then a final solve for the camera centers. The pipeline developed by 
Sinha, Steedly, and Szeliski (2010) also matches vanishing points, when these can be found, 
in order to eliminate rotational drift in the global orientation estimates. 


30While almost of all of these techniques assume known calibration (focal lengths) for each image, Sweeney, 
Kneip et al. (2015) estimate focal lengths from refined fundamental matrices. 
3! We have already introduced the concept of rotation averaging when we discussed global registration of panora- 


mas in Section 8.3.1. 
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While there are several alternative algorithms for estimating the global rotations, an even 
wider variety of algorithms exists for estimating the camera centers. After rotating all of the 
cameras by their global rotation estimate, we can compute globally oriented local translation 
direction in each reconstructed pair 77 and denote this as tij. The fundamental relationship 


between the unknown camera centers {c;} and the translation directions can be written as 


A 


cj — Ci = Sijtij (11.69) 


or 


(Govindu 2001). The first set of equations can be solved to obtain the camera centers {c;} 
and the scale variables s;;, while the second directly produces only the camera positions. In 
addition to being homogeneous (only known up to a scale), the camera centers also have a 
translational gauge freedom, i.e., they can all be translated (but this is always the case with 
structure from motion). 

Because these equations minimize the algebraic alignment between local translation di- 
rections and global camera center differences, they do not correctly weight reconstructions 
with different baselines. Several alternatives have been proposed to remediate this (Govindu 
2004; Sinha, Steedly, and Szeliski 2010; Jiang, Cui, and Tan 2013; Moulon, Monasse, and 
Marlet 2013; Wilson and Snavely 2014; Cui and Tan 2015; Özyeşil and Singer 2015; Holyn- 
ski, Geraghty et al. 2020). Some of these techniques also cannot handle collinear cameras, as 
in the original formulation, as well as some more recent ones, we can shift cameras along a 
collinear segment and still satisfy the directional constraints. 

For community photo collections taken over a large area such as a plaza, this is not a cru- 
cial problem (Wilson and Snavely 2014). However, for reconstructions from video or walks 
around or through a building, the collinear camera problem is a real issue. Sinha, Steedly, 
and Szeliski (2010) handle this by estimating the relative scales of pairwise reconstructions 
that share a common camera and then use these relative scales to constraint all of the global 
scales. 

Two open-source structure from motion pipelines that include some of these global tech- 
niques are Theia*2 (Sweeney, Hollerer, and Turk 2015) and OpenMVG** (Moulon, Monasse 


et al. 2016). The papers have nice reviews of the related algorithms. 


32http://www.theia-sfm.org 
33https://github.com/openMVG/openMVG 
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Figure 11.22 Two images of a toy house along with their matched 3D line segments 
(Schmid and Zisserman 1997) O 1997 Springer. 


11.48 Constrained structure and motion 


The most general algorithms for structure from motion make no prior assumptions about the 
objects or scenes that they are reconstructing. In many cases, however, the scene contains 
higher-level geometric primitives, such as lines and planes. These can provide information 
complementary to interest points and also serve as useful building blocks for 3D modeling 
and visualization. Furthermore, these primitives are often arranged in particular relationships, 
1.e., many lines and planes are either parallel or orthogonal to each other (Zhou, Furukawa, 
and Ma 2019; Zhou, Furukawa et al. 2020). This is particularly true of architectural scenes 
and models, which we study in more detail in Section 13.6.1. 

Sometimes, instead of exploiting regularity in the scene structure, it is possible to take 
advantage of a constrained motion model. For example, if the object of interest is rotating on 
a turntable (Szeliski 1991b), i.e., around a fixed but unknown axis, specialized techniques can 
be used to recover this motion (Fitzgibbon, Cross, and Zisserman 1998). In other situations, 
the camera itself may be moving in a fixed arc around some center of rotation (Shum and 
He 1999). Specialized capture setups, such as mobile stereo camera rigs or moving vehicles 
equipped with multiple fixed cameras, can also take advantage of the knowledge that individ- 


ual cameras are (mostly) fixed with respect to the capture rig, as shown in Figure 11.15,% 


Line-based techniques 


It is well known that pairwise epipolar geometry cannot be recovered from line matches 
alone, even if the cameras are calibrated. To see this, think of projecting the set of lines in 
each image into a set of 3D planes in space. You can move the two cameras around into any 
configuration you like and still obtain a valid reconstruction for 3D lines. 


34Because of mechanical compliance and jitter, it may be prudent to allow for a small amount of individual camera 


rotation around a nominal position. 
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When lines are visible in three or more views, the trifocal tensor can be used to transfer 
lines from one pair of images to another (Hartley and Zisserman 2004). The trifocal tensor 
can also be computed on the basis of line matches alone. 

Schmid and Zisserman (1997) describe a widely used technique for matching 2D lines 
based on the average of 15 x 15 pixel correlation scores evaluated at all pixels along their 
common line segment intersection.’ In their system, the epipolar geometry is assumed to be 
known, e.g., computed from point matches. For wide baselines, all possible homographies 
corresponding to planes passing through the 3D line are used to warp pixels and the maximum 
correlation score is used. For triplets of images, the trifocal tensor is used to verify that 
the lines are in geometric correspondence before evaluating the correlations between line 
segments. Figure 11.22 shows the results of using their system. 

Bartoli and Sturm (2003) describe a complete system for extending three view relations 
(trifocal tensors) computed from manual line correspondences to a full bundle adjustment of 
all the line and camera parameters. The key to their approach is to use the Pliicker coor- 
dinates (2.12) to parameterize lines and to directly minimize reprojection errors. It is also 
possible to represent 3D line segments by their endpoints and to measure either the reprojec- 
tion error perpendicular to the detected 2D line segments in each image or the 2D errors using 
an elongated uncertainty ellipse aligned with the line segment direction (Szeliski and Kang 
1994). 

Instead of reconstructing 3D lines, Bay, Ferrari, and Van Gool (2005) use RANSAC to 
group lines into likely coplanar subsets. Four lines are chosen at random to compute a homog- 
raphy, which is then verified for these and other plausible line segment matches by evaluating 
color histogram-based correlation scores. The 2D intersection points of lines belonging to the 
same plane are then used as virtual measurements to estimate the epipolar geometry, which 
1s more accurate than using the homographies directly. 

An alternative to grouping lines into coplanar subsets is to group lines by parallelism. 
Whenever three or more 2D lines share a common vanishing point, there is a good likelihood 
that they are parallel in 3D. By finding multiple vanishing points in an image (Section 7.4.3) 
and establishing correspondences between such vanishing points in different images, the rel- 
ative rotations between the various images (and often the camera intrinsics) can be directly 
estimated (Section 11.1.1). Finding an orthogonal set of vanishing points and using these 
to establish a global orientation is often called invoking the Manhattan world assumption 
(Coughlan and Yuille 1999). A generalized version where streets can meet at non-orthogonal 
angles was called the Atlanta world by Schindler and Dellaert (2004). 


35Because lines often occur at depth or orientation discontinuities, it may be preferable to compute correlation 


scores (or to match color histograms (Bay, Ferrari, and Van Gool 2005)) separately on each side of the line. 
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Shum, Han, and Szeliski (1998) describe a 3D modeling system that constructs calibrated 
panoramas from multiple images (Section 11.4.2) and then has the user draw vertical and 
horizontal lines in the image to demarcate the boundaries of planar regions. The lines are 
used to establish an absolute rotation for each panorama and are then used (along with the 
inferred vertices and planes) to build a 3D structure, which can be recovered up to scale from 
one or more images (Figure 13.20). 

A fully automated approach to line-based structure from motion is presented by Werner 
and Zisserman (2002). In their system, they first find lines and group them by common 
vanishing points in each image (Section 7.4.3). The vanishing points are then used to calibrate 
the camera, i.e., to perform a “metric upgrade” (Section 11.1.1). Lines corresponding to 
common vanishing points are then matched using both appearance (Schmid and Zisserman 
1997) and trifocal tensors. These lines are then used to infer planes and a block-structured 
model for the scene, as described in more detail in Section 13.6.1. More recent work using 
deep neural networks can also be used to construct 3D wireframe models from one or more 


images. 


Plane-based techniques 


In scenes that are rich in planar structures, e.g., in architecture, it is possible to directly es- 
timate homographies between different planes, using either feature-based or intensity-based 
methods. In principle, this information can be used to simultaneously infer the camera poses 
and the plane equations, i.e., to compute plane-based structure from motion. 

Luong and Faugeras (1996) show how a fundamental matrix can be directly computed 
from two or more homographies using algebraic manipulations and least squares. Unfortu- 
nately, this approach often performs poorly, because the algebraic errors do not correspond to 
meaningful reprojection errors (Szeliski and Torr 1998). 

A better approach is to hallucinate virtual point correspondences within the areas from 
which each homography was computed and to feed them into a standard structure from mo- 
tion algorithm (Szeliski and Torr 1998). An even better approach is to use full bundle adjust- 
ment with explicit plane equations, as well as additional constraints to force reconstructed 
co-planar features to lie exactly on their corresponding planes. (A principled way to do this 
is to establish a coordinate frame for each plane, e.g., at one of the feature points, and to use 
2D in-plane parameterizations for the other points.) The system developed by Shum, Han, 
and Szeliski (1998) shows an example of such an approach, where the directions of lines and 
normals for planes in the scene are prespecified by the user. In more recent work, Micusik 
and Wildenauer (2017) use planes as additional constraints inside a bundle adjustment for- 


mulation. Other recent papers that use combinations of lines and/or planes to reduce drift in 
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Figure 11.23 In simultaneous localization and mapping (SLAM), the system simultane- 


ously estimates the positions of a robot and its nearby landmarks (Durrant-Whyte and Bailey 
2006) O 2006 IEEE. 


3D reconstructions include (Zhou, Zou et al. 2015), Li, Yao et al. (2018), Yang and Scherer 
(2019), and Holynski, Geraghty et al. (2020). 


11.5 Simultaneous localization and mapping (SLAM) 


While the computer vision community has been studying structure from motion, i.e., the re- 
construction of sparse 3D models from multiple images and videos, since the early 1980s 
(Longuet-Higgins 1981), the mobile robotics community has in parallel been studying the 
automatic construction of 3D maps from moving robots.*° In robotics, the problem was for- 
mulated as the simultaneous estimation of 3D robot and landmark poses (Figure 11.23), and 
was known as probabilistic mapping (Thrun, Burgard, and Fox 2005) and simultaneous local- 
ization and mapping (SLAM) (Durrant-Whyte and Bailey 2006; Bailey and Durrant-Whyte 
2006; Cadena, Carlone et al. 2016). In the computer vision community, the problem was 
originally called visual odometry (Levin and Szeliski 2004; Nistér, Naroditsky, and Bergen 
2006; Maimone, Cheng, and Matthies 2007), although that term is now usually reserved for 
shorter-range motion estimation that does not involve building a global map with loop closing 
(Cadena, Carlone et al. 2016). 


Early versions of such algorithms used range-sensing techniques, such as ultrasound, laser 


36In the 1980s, the vision and robotics communities were essentially the same set of researchers working in these 
two sub-fields of artificial intelligence. 
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Figure 11.24 The architecture of the LSD-SLAM system (Engel, Schöps, and Cremers 
2014) © 2014 Springer, showing the front end, which does the tracking, data association, 
and local 3D pose and structure (depth map) updating, and the back end, which does global 
map optimization. 


range finders, or stereo matching, to estimate local 3D geometry, which could then be fused 
into a 3D model. Newer techniques can perform the same task based purely on visual feature 
tracking from a monocular camera (Davison, Reid et al. 2007). Good introductory tutorials 
can be found in Durrant-Whyte and Bailey (2006) and Bailey and Durrant-Whyte (2006), 
while more comprehensive surveys of more recent techniques are presented in (Fuentes- 
Pacheco, Ruiz-Ascencio, and Rendón-Mancha 2015) and Cadena, Carlone et al. (2016). 

SLAM differs from bundle adjustment in two fundamental aspects. First, it allows for a 
variety of sensing devices, instead of just being restricted to tracked or matched feature points. 
Second, it solves the localization problem online, i.e., with no or very little lag in providing 
the current sensor pose. This makes it the method of choice for both time-critical robotics 
applications such as autonomous navigation (Section 11.5.1) and real-time augmented reality 
(Section 11.5.2). 


Some of the important milestones in SLAM include: 


e the application of SLAM to monocular cameras (MonoSLAM) (Davison, Reid et al. 
2007); 


e parallel tracking and mapping (PTAM) (Klein and Murray 2007), which split the front 
end (tracking) and back end (mapping) processes (Figure 11.24) onto two separate 
threads running at different rates (Figure 11.27) and then implemented the whole pro- 
cess on a camera phone (Klein and Murray 2009); 
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adaptive relative bundle adjustment (Sibley, Mei et al. 2009, 2010), which maintains 


collections of local reconstructions anchored at different keyframes; 


incremental smoothing and mapping (SAM) (Kaess, Ranganathan, and Dellaert 2008; 
Kaess, Johannsson et al. 2012) and other applications of factor graphs to handle the 
speed-accuracy-delay tradeoff (Dellaert and Kaess 2017; Dellaert 2021); 


dense tracking and mapping (DTAM) (Newcombe, Lovegrove, and Davison 2011), 


which estimates and updates a dense depth map for every frame; 


ORB-SLAM (Mur-Artal, Montiel, and Tardos 2015) and ORB-SLAM2 (Mur-Artal 
and Tardós 2017), which handle monocular, stereo, and RGB-D cameras as well as 


loop closures; 


SVO (semi-direct visual odometry) (Forster, Zhang et al. 2017), which combines patch- 


based tracking with classic bundle adjustment; and 


LSD-SLAM (large-scale direct SLAM) (Engel, Schóps, and Cremers 2014) and DSO 
(direct sparse odometry) (Engel, Koltun, and Cremers 2018), which only keep depth 
estimates at strong gradient locations (Figure 11.24). 


BAD SLAM (bundle adjusted direct RGB-D SLAM) (Schóps, Sattler, and Pollefeys 
2019a). 


Many of these systems have open source implementations. Some widely used benchmarks 
include a benchmark for RGB-D SLAM systems (Sturm, Engelhard et al. 2012), the KITTI 
Visual Odometry / SLAM benchmark (Geiger, Lenz et al. 2013), the synthetic ICL-NUIM 
dataset (Handa, Whelan et al. 2014), the TUM monoVO dataset (Engel, Usenko, and Cremers 
2016), the EuRoC MAV dataset (Burri, Nikolic et al. 2016), the ETH3D SLAM benchmark 
(Schóps, Sattler, and Pollefeys 2019a), and the GSLAM general SLAM benchmark (Zhao, 
Xu et al. 2019). 

The most recent trend in SLAM has been the integration with visual-inertial odometry 
(VIO) algorithms (Mourikis and Roumeliotis 2007; Li and Mourikis 2013; Forster, Carlone 
et al. 2016), which combine higher-frequency inertial measurement unit (IMU) measure- 
ments with visual tracks, which serve to remove low-frequency drift. Because IMUs are now 
commonplace in consumer devices such as cell phones and action cameras, VIO-enhanced 
SLAM systems serve as the foundation for widely used mobile augmented reality frameworks 
such as ARKit and ARCore (Section 11.5.2). A dataset and evaluation of open-source VIO 
systems can be found at Schubert, Goll et al. (2018). 
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Figure 11.25 Autonomous vehicles: (a) the Stanford Cart (Moravec 1983) ©1983 IEEE; 
(b) Junior: The Stanford entry in the Urban Challenge (Montemerlo, Becker et al. 2008) © 
2008 Wiley; (c-d) self-driving car prototypes from the CVPR 2019 exhibit floor. 


As you can tell from this very brief overview, SLAM is an incredibly rich and rapidly 
evolving field of research, full of challenging robust optimization and real-time performance 
problems. A good source for finding a list of the most recent papers and algorithms is the 
KITTI Visual Odometry/SLAM Evaluation?’ (Geiger, Lenz, and Urtasun 2012) and the re- 
cent survey paper on computer vision for autonomous driving (Janai, Güney ef al. 2020, 
Section 13.2). 


11.5.1 Application: Autonomous navigation 


Since the early days of artificial intelligence and robotics, computer vision has been used to 
enable manipulation for dextrous robots and navigation for autonomous robots (Janai, Giiney 
et al. 2020; Kubota 2019). Some of the earliest vision-based navigation systems include 
the Stanford Cart (Figure 11.25a) and CMU Rover (Moravec 1980, 1983), the Terregator 
(Wallace, Stentz et al. 1985), and the CMU Nablab (Thorpe, Hebert et al. 1988), which 


37http://www.cvlibs.net/datasets/kitti/eval_odometry.php 
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(b) 


Figure 11.26 Fully autonomous Skydio R1 drone flying in the wild © 2019 Skydio: (a) 
multiple input images and depth maps; (b) fully integrated 3D map (Cross 2019). 


originally could only advance 4m every 10 sec (< 1 mph), and which was also the first 
system to use a neural network for driving (Pomerleau 1989). 

The early algorithms and technologies advanced rapidly, with the VaMoRs system of 
Dickmanns and Mysliwetz (1992) operating a 25Hz Kalman filter loop and driving with good 
lane markings at 100 km/h. By the mid 2000s, when DARPA introduced their Grand Chal- 
lenge and Urban Challenge, vehicles equipped with both range-finding lidar cameras and 
stereo cameras were able to traverse rough outdoor terrain and navigate city streets at regular 
human driving speeds (Urmson, Anhalt et al. 2008; Montemerlo, Becker et al. 2008).38 These 
systems led to the formation of industrial research projects at companies such as Google and 
Tesla,*? as well numerous startups, many of which exhibit their vehicles at computer vision 
conferences (Figure 11.25c—d). 

A comprehensive review of computer vision technologies for autonomous vehicles can 
be found in the survey by Janai, Giiney et al. (2020), which also comes with a useful on-line 
visualization tool of relevant papers.*” The survey contains chapters on the large number 
of vision algorithms and components that go into autonomous navigation, which include 
datasets and benchmarks, sensors, object detection and tracking, segmentation, stereo, flow 
and scene flow, SLAM, scene understanding, and end-to-end learning of autonomous driving 
behaviors. 


In addition to autonomous navigation for wheeled (and legged) robots and vehicles, com- 


38 Algorithms that use range data as part of their map building and localization are commonly called RGB-D SLAM 
systems (Sturm, Engelhard ef al. 2012). 

39 You can find a number of talks about Tesla’s efforts on Andrej Karpathy’s web page, https://karpathy.ai. 

Whttp://www.cvlibs.net/projects/autonomous-vision-survey 
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Figure 11.27 3D augmented reality: (a) Darth Vader and a horde of Ewoks battle it out 
on a table-top recovered using real-time, keyframe-based structure from motion (Klein and 
Murray 2007) O 2007 IEEE; (b) a virtual teapot is fixed to the top of a real-world coffee cup, 
whose pose is re-recognized at each time frame (Gordon and Lowe 2006) O 2007 Springer. 


puter vision algorithms are widely used in the control of autonomous drones for both recre- 
ational applications (Ackerman 2019) (Figure 11.26) and drone racing (Jung, Hwang et al. 
2018; Kaufmann, Gehrig et al. 2019). A great talk describing Skydio’s approach to visual 
autonomous navigation by Gareth Cross (2019) can be found in the ICRA 2019 Workshop on 
Algorithms and Architectures for Learning In-The-Loop Systems in Autonomous Flight*! as 
well as Lecture 23 in Pieter Abbeel’s (2019) class on Advanced Robotics, which has dozens 
of other interesting related lectures. 


11.5.2 Application: Smartphone augmented reality 


Another closely related application is augmented reality, where 3D objects are inserted into 
a video feed in real time, often to annotate or help users understand a scene (Azuma, Bail- 
lot et al. 2001; Feiner 2002; Billinghurst, Clark, and Lee 2015). While traditional systems 
require prior knowledge about the scene or object being visually tracked (Rosten and Drum- 
mond 2005), newer systems can simultaneously build up a model of the 3D environment and 
then track it so that graphics can be superimposed (Reitmayr and Drummond 2006; Wagner, 
Reitmayr et al. 2008). 

Klein and Murray (2007) describe a parallel tracking and mapping (PTAM) system, 
which simultaneously applies full bundle adjustment to keyframes selected from a video 
stream, while performing robust real-time pose estimation on intermediate frames (Figure 1 1.27a). 


A https://uav-learning-icra.github.io/2019 
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Once an initial 3D scene has been reconstructed, a dominant plane is estimated (in this case, 
the table-top) and 3D animated characters are virtually inserted. Klein and Murray (2008) ex- 
tend this system to handle even faster camera motion by adding edge features, which can still 
be tracked even when interest points become too blurred. They also use a direct (intensity- 
based) rotation estimation algorithm for even faster motions. 

Instead of modeling the whole scene as one rigid reference frame, Gordon and Lowe 
(2006) first build a 3D model of an individual object using feature matching and structure 
from motion. Once the system has been initialized, for every new frame they find the object 
and its pose using a 3D instance recognition algorithm, and then superimpose a graphical 
object onto that model, as shown in Figure 11.27b. 

While reliably tracking such objects and environments is now a well-solved problem, with 
frameworks such as ARKit,* ARCore,* and Spark AR* being widely used for mobile AR 
application development, determining which pixels should be occluded by foreground scene 
elements (Chuang, Agarwala et al. 2002; Wang and Cohen 2009) still remains an active 
research area. 

One recent example of such work is the Smartphone AR system developed by Valentin, 
Kowdle et al. (2018) shown in Figure 11.28. The system proceeds by generating a semi-dense 
depth map by matching the current frame to a previous keyframe using a CRF followed by 
a filtering step. This map is then interpolated to full resolution using a novel planar bilateral 
solver, and the resulting depth map used for occlusion effects. As accurate per-pixel depth is 
such an essential component of augmented reality effects, we are likely to see rapid progress 


in this area, using both active and passive depth sensing technologies. 


11.6 Additional reading 


Camera calibration was first studied in photogrammetry (Brown 1971; Slama 1980; Atkinson 
1996; Kraus 1997) but it has also been widely studied in computer vision (Tsai 1987; Grem- 
ban, Thorpe, and Kanade 1988; Champleboux, Lavallée et al. 1992b; Zhang 2000; Grossberg 
and Nayar 2001). Vanishing points observed either from rectahedral calibration objects or ar- 
chitecture are often used to perform rudimentary calibration (Caprile and Torre 1990; Becker 
and Bove 1995; Liebowitz and Zisserman 1998; Cipolla, Drummond, and Robertson 1999; 
Antone and Teller 2002; Criminisi, Reid, and Zisserman 2000; Hartley and Zisserman 2004; 


Pflugfelder 2008). Performing camera calibration without using known targets is known as 


hnttps://developer.apple.com/augmented-reality 
Bhttps://developers.google.com/ar 
4https://sparkar.facebook.com/ar-studio 
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Shopping Navigation Fun photos 


Figure 11.28 Smartphone augmented reality showing real-time depth occlusion effects 
(Valentin, Kowdle et al. 2018) O 2018 ACM. 


self-calibration and is discussed in textbooks and surveys on structure from motion (Faugeras, 
Luong, and Maybank 1992; Hartley and Zisserman 2004; Moons, Van Gool, and Vergauwen 
2010). One popular subset of such techniques uses pure rotational motion (Stein 1995; Hart- 
ley 1997b; Hartley, Hayman et al. 2000; de Agapito, Hayman, and Reid 2001; Kang and 
Weiss 1999; Shum and Szeliski 2000; Frahm and Koch 2003). 


The topic of registering 3D point datasets is called absolute orientation (Horn 1987) and 
3D pose estimation (Lorusso, Eggert, and Fisher 1995). A variety of techniques has been 
developed for simultaneously computing 3D point correspondences and their corresponding 
rigid transformations (Besl and McKay 1992; Zhang 1994; Szeliski and Lavallée 1996; Gold, 
Rangarajan et al. 1998; David, DeMenthon et al. 2004; Li and Hartley 2007; Enqvist, Joseph- 
son, and Kahl 2009). When only 2D observations are available, a variety of algorithms for 
the linear PnP (perspective n-point) have been developed (DeMenthon and Davis 1995; Quan 
and Lan 1999; Moreno-Noguer, Lepetit, and Fua 2007; Terzakis and Lourakis 2020). More 
recent approaches to pose estimation use deep networks (Arandjelovic, Gronat et al. 2016; 
Brachmann, Krull et al. 2017; Xiang, Schmidt et al. 2018; Oberweger, Rad, and Lepetit 2018; 
Hu, Hugonot et al. 2019; Peng, Liu ef al. 2019). Estimating pose from RGB-D images is also 
very active (Drost, Ulrich et al. 2010; Brachmann, Michel et al. 2016; Labbé, Carpentier et 
al. 2020). In addition to recognizing object pose for robotics tasks, pose estimation is widely 
used in location recognition (Sattler, Zhou et al. 2019; Revaud, Weinzaepfel et al. 2019; 
Zhou, Sattler et al. 2019; Sarlin, DeTone et al. 2020; Luo, Zhou et al. 2020). 


The topic of structure from motion is extensively covered in books and review articles on 


multi-view geometry (Faugeras and Luong 2001; Hartley and Zisserman 2004; Moons, Van 
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Gool, and Vergauwen 2010) with survey of more recent developments in Ozyesil, Voroninski 
et al. (2017). For two-frame reconstruction, Hartley (1997a) wrote a highly cited paper on 
the “eight-point algorithm” for computing an essential or fundamental matrix with reasonable 
point normalization. When the cameras are calibrated, the five-point algorithm of Nistér 
(2004) can be used in conjunction with RANSAC to obtain initial reconstructions from the 
minimum number of points. When the cameras are uncalibrated, various self-calibration 
techniques can be found in work by Hartley and Zisserman (2004) and Moons, Van Gool, 
and Vergauwen (2010). 


Triggs, McLauchlan et al. (1999) provide a good tutorial and survey on bundle adjust- 
ment, while Lourakis and Argyros (2009) and Engels, Stewénius, and Nistér (2006) provide 
tips on implementation and effective practices. Bundle adjustment is also covered in text- 
books and surveys on multi-view geometry (Faugeras and Luong 2001; Hartley and Zisser- 
man 2004; Moons, Van Gool, and Vergauwen 2010). Techniques for handling larger problems 
are described by Snavely, Seitz, and Szeliski (2008b), Agarwal, Snavely et al. (2009), Agar- 
wal, Snavely et al. (2010), Jeong, Nistér et al. (2012), Wu (2013), Heinly, Schönberger et al. 
(2015), Schónberger and Frahm (2016), and Dellaert and Kaess (2017). While bundle adjust- 
ment is often called as an inner loop inside incremental reconstruction algorithms (Snavely, 
Seitz, and Szeliski 2006), hierarchical (Fitzgibbon and Zisserman 1998; Farenzena, Fusiello, 
and Gherardi 2009) and global (Rother and Carlsson 2002; Martinec and Pajdla 2007; Sinha, 
Steedly, and Szeliski 2010; Jiang, Cui, and Tan 2013; Moulon, Monasse, and Marlet 2013; 
Wilson and Snavely 2014; Cui and Tan 2015; Özyeşil and Singer 2015; Holynski, Geraghty 
et al. 2020) approaches for initialization are also possible and perhaps even preferable. 


In the robotics community, techniques for reconstructing a 3D environment from a mov- 
ing robot are called simultaneous localization and mapping (SLAM) (Thrun, Burgard, and 
Fox 2005; Durrant-Whyte and Bailey 2006; Bailey and Durrant-Whyte 2006; Fuentes-Pacheco, 
Ruiz-Ascencio, and Rendón-Mancha 2015; Cadena, Carlone et al. 2016). SLAM differs from 
bundle adjustment in that it allows for a variety of sensing devices and that it solves the lo- 
calization problem online. This makes it the method of choice for both time-critical robotics 
applications such as autonomous navigation (Janai, Güney et al. 2020) and real-time aug- 
mented reality (Valentin, Kowdle et al. 2018). Important papers in this field include (Davison, 
Reid et al. 2007; Klein and Murray 2007, 2009; Newcombe, Lovegrove, and Davison 2011; 
Kaess, Johannsson et al. 2012; Engel, Schóps, and Cremers 2014; Mur-Artal and Tardós 
2017; Forster, Zhang et al. 2017; Dellaert and Kaess 2017; Engel, Koltun, and Cremers 2018; 
Schóps, Sattler, and Pollefeys 2019a) as well as papers that integrate SLAM with IMUs to ob- 
tain visual inertial odometry (VIO) (Mourikis and Roumeliotis 2007; Li and Mourikis 2013; 
Forster, Carlone et al. 2016; Schubert, Goll et al. 2018). 
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11.7 Exercises 


Ex 11.1: Rotation-based calibration. Take an outdoor or indoor sequence from a rotating 
camera with very little parallax and use it to calibrate the focal length of your camera using 
the techniques described in Section 11.1.3 or Sections 8.2.3-8.3.1. 


1. Take out any radial distortion in the images using one of the techniques from Exer- 


cises 11.5-11.6 or using parameters supplied for a given camera by your instructor. 


2. Detect and match feature points across neighboring frames and chain them into feature 


tracks. 


3. Compute homographies between overlapping frames and use Equations (11.8-11.9) to 
get an estimate of the focal length. 


4. Compute a full 360° panorama and update your focal length estimate to close the gap 
(Section 8.2.4). 


5. (Optional) Perform a complete bundle adjustment in the rotation matrices and focal 


length to obtain the highest quality estimate (Section 8.3.1). 


Ex 11.2: Target-based calibration. Use a three-dimensional target to calibrate your cam- 


era. 


1. Construct a three-dimensional calibration pattern with known 3D locations. It is not 
easy to get high accuracy unless you use a machine shop, but you can get close using 
heavy plywood and printed patterns. 


2. Find the corners, e.g, using a line finder and intersecting the lines. 


3. Implement one of the iterative calibration and pose estimation algorithms described in 
Tsai (1987), Bogart (1991), or Gleicher and Witkin (1992) or the system described in 
Section 11.2.2. 


4. Take many pictures at different distances and orientations relative to the calibration 
target and report on both your re-projection errors and accuracy. (To do the latter, you 


may need to use simulated data.) 


Ex 11.3: Calibration accuracy. Compare the three calibration techniques (plane-based, rotation- 
based, and 3D-target-based). 
One approach is to have a different student implement each one and to compare the results. 


Another approach is to use synthetic data, potentially re-using the software you developed 
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for Exercise 2.3. The advantage of using synthetic data is that you know the ground truth 
for the calibration and pose parameters, you can easily run lots of experiments, and you can 
synthetically vary the noise in your measurements. 


Here are some possible guidelines for constructing your test sets: 
1. Assume a medium-wide focal length (say, 50° field of view). 


2. For the plane-based technique, generate a 2D grid target and project it at different 


inclinations. 


3. For a 3D target, create an inner cube corner and position it so that 1t fills most of field 


of view. 


4. For the rotation technique, scatter points uniformly on a sphere until you get a similar 


number of points as for other techniques. 


Before comparing your techniques, predict which one will be the most accurate (normalize 
your results by the square root of the number of points used). 
Add varying amounts of noise to your measurements and describe the noise sensitivity of 


your various techniques. 


Ex 11.4: Single view metrology. Implement a system to measure dimensions and recon- 
struct a 3D model from a single image of an architectural scene using visible vanishing direc- 
tions (Section 11.1.2) (Criminisi, Reid, and Zisserman 2000). 


1. Find the three orthogonal vanishing points from parallel lines and use them to establish 
the three coordinate axes (rotation matrix R of the camera relative to the scene). If 
two of the vanishing points are finite (not at infinity), use them to compute the focal 
length, assuming a known image center. Otherwise, find some other way to calibrate 
your camera; you could use some of the techniques described by Schaffalitzky and 
Zisserman (2000). 


2. Click on a ground plane point to establish your origin and click on a point a known 
distance away to establish the scene scale. This lets you compute the translation t 
between the camera and the scene. As an alternative, click on a pair of points, one 
on the ground plane and one above it, and use the known height to establish the scene 


scale. 


3. Write a user interface that lets you click on ground plane points to recover their 3D 


locations. (Hint: you already know the camera matrix, so knowledge of a point's z 
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value is sufficient to recover its 3D location.) Click on pairs of points (one on the 


ground plane, one above it) to measure vertical heights. 


4. Extend your system to let you draw quadrilaterals in the scene that correspond to axis- 
aligned rectangles in the world, using some of the techniques described by Sinha, 
Steedly et al. (2008). Export your 3D rectangles to a VRML or PLY* file. 


5. (Optional) Warp the pixels enclosed by the quadrilateral using the correct homography 
to produce a texture map for each planar polygon. 


Ex 11.5: Radial distortion with plumb lines. Implement a plumb-line algorithm to deter- 


mine the radial distortion parameters. 


1. Take some images of scenes with lots of straight lines, e.g., hallways in your home or 


office, and try to get some of the lines as close to the edges of the image as possible. 


2. Extract the edges and link them into curves, as described in Section 7.2.2 and Exer- 


cise 7.8. 


3. Fit quadratic or elliptic curves to the linked edges using a generalization of the suc- 
cessive line approximation algorithm described in Section 7.4.1 and Exercise 7.11 and 


keep the curves that fit this form well. 


4. For each curved segment, fit a straight line and minimize the perpendicular distance 


between the curve and the line while adjusting the radial distortion parameters. 


5. Alternate between re-fitting the straight line and adjusting the radial distortion param- 


eters until convergence. 


Ex 11.6: Radial distortion with a calibration target. Use a grid calibration target to de- 


termine the radial distortion parameters. 


1. Print out a planar calibration target, mount it on a stiff board, and get it to fill your field 
of view. 


2. Detect the squares, lines, or dots in your calibration target. 


3. Estimate the homography mapping the target to the camera from the central portion of 
the image that does not have any radial distortion. 


4 https://meshlab.net. 


746 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


4. Predict the positions of the remaining targets and use the differences between the ob- 


served and predicted positions to estimate the radial distortion. 


5. (Optional) Fit a general spline model (for severe distortion) instead of the quartic dis- 
tortion model. 


6. (Optional) Extend your technique to calibrate a fisheye lens. 


Ex 11.7: Chromatic aberration. Use the radial distortion estimates for each color channel 
computed in the previous exercise to clean up wide-angle lens images by warping all of the 
channels into alignment. (Optional) Straighten out the images at the same time. 


Can you think of any reasons why this warping strategy may not always work? 


Ex 11.8: Triangulation. Use the calibration pattern you built and tested in Exercise 11.2 to 
test your triangulation accuracy. As an alternative, generate synthetic 3D points and cameras 
and add noise to the 2D point measurements. 


1. Assume that you know the camera pose, i.e., the camera matrices. Use the 3D distance 
to rays (11.24) or linearized versions of Equations (11.25—11.26) to compute an initial 


set of 3D locations. Compare these to your known ground truth locations. 


2. Use iterative non-linear minimization to improve your initial estimates and report on 


the improvement in accuracy. 


3. (Optional) Use the technique described by Hartley and Sturm (1997) to perform two- 
frame triangulation. 


4. See if any of the failure modes reported by Hartley and Sturm (1997) or Hartley (1998) 


occur in practice. 


Ex 11.9: Essential and fundamental matrix. Implement the two-frame E and F matrix 
estimation techniques presented in Section 11.3, with suitable re-scaling for better noise im- 


munity. 


1. Use the data from Exercise 11.8 to validate your algorithms and to report on their 
accuracy. 


2. (Optional) Implement one of the improved F or E estimation algorithms, e.g., us- 
ing renormalization (Zhang 1998b; Torr and Fitzgibbon 2004; Hartley and Zisserman 
2004), RANSAC (Torr and Murray 1997), least median of squares (LMS), or the five- 
point algorithm developed by Nistér (2004). 
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Ex 11.10: View morphing and interpolation. Implement automatic view morphing, i.e., 


compute two-frame structure from motion and then use these results to generate a smooth 


animation from one image to the next (Section 11.3.5). 


1. 


Decide how to represent your 3D scene, e.g., compute a Delaunay triangulation of the 
matched point and decide what to do with the triangles near the border. (Hint: try fitting 


a plane to the scene, e.g., behind most of the points.) 


. Compute your in-between camera positions and orientations. 


. Warp each triangle to its new location, preferably using the correct perspective projec- 


tion (Szeliski and Shum 1997). 


. (Optional) If you have a denser 3D model (e.g., from stereo), decide what to do at the 


“cracks”. 


. (Optional) For a non-rigid scene, e.g., two pictures of a face with different expressions, 


not all of your matched points will obey the epipolar geometry. Decide how to handle 


them to achieve the best effect. 


Ex 11.11: Bundle adjuster. Implement a full bundle adjuster. This may sound daunting, 


but it really is not. 


1. 


Devise the internal data structures and external file representations to hold your camera 
parameters (position, orientation, and focal length), 3D point locations (Euclidean or 
homogeneous), and 2D point tracks (frame and point identifier as well as 2D locations). 


. Use some other technique, such as factorization, to initialize the 3D point and camera 


locations from your 2D tracks (e.g., a subset of points that appears in all frames). 


. Implement the code corresponding to the forward transformations in Figure 11.14, i.e., 


for each 2D point measurement, take the corresponding 3D point, map it through the 
camera transformations (including perspective projection and focal length scaling), and 


compare it to the 2D point measurement to get a residual error. 


. Take the residual error and compute its derivatives with respect to all the unknown mo- 


tion and structure parameters, using backward chaining, as shown, e.g., in Figure 11.14 
and Equation (11.19). This gives you the sparse Jacobian J used in Equations (8.13 
8.17) and Equation (11.15). 
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5. Use a sparse least squares or linear system solver, e.g., MATLAB, SparseSuite, or 
SPARSKIT (see Appendix A.4 and A.5), to solve the corresponding linearized system, 


adding a small amount of diagonal preconditioning, as in Levenberg—Marquardt. 


6. Update your parameters, make sure your rotation matrices are still orthonormal (e.g., 
by re-computing them from your quaternions), and continue iterating while monitoring 


your residual error. 


7. (Optional) Use the “Schur complement trick” (11.68) to reduce the size of the system 
being solved (Triggs, McLauchlan et al. 1999; Hartley and Zisserman 2004; Lourakis 
and Argyros 2009; Engels, Stewénius, and Nistér 2006). 


8. (Optional) Implement your own iterative sparse solver, e.g., conjugate gradient, and 


compare its performance to a direct method. 


9. (Optional) Make your bundle adjuster robust to outliers, or try adding some of the other 
improvements discussed in (Engels, Stewénius, and Nistér 2006). Can you think of any 


other ways to make your algorithm even faster or more robust? 


Ex 11.12: Match move and augmented reality. Use the results of the previous exercise to 
superimpose a rendered 3D model on top of video. See Section 11.4.4 for more details and 


ideas. Check for how “locked down” the objects are. 


Ex 11.13: Line-based reconstruction. Augment the previously developed bundle adjuster 
to include lines, possibly with known 3D orientations. 

Optionally, use co-planar sets of points and lines to hypothesize planes and to enforce 
co-planarity (Schaffalitzky and Zisserman 2002; Robertson and Cipolla 2002). 


Ex 11.14: Flexible bundle adjuster. Design a bundle adjuster that allows for arbitrary chains 
of transformations and prior knowledge about the unknowns, as suggested in Figures 11.14— 
11,15. 


Ex 11.15: Unordered image matching. Compute the camera pose and 3D structure of a 
scene from an arbitrary collection of photographs (Brown and Lowe 2005; Snavely, Seitz, 
and Szeliski 2006). 


Ex 11.16: Augmented reality toolkits. Write a simple mobile AR app based on one of the 
widely used augmented reality frameworks such as ARKit or ARCore. What fun effects can 
you create? What are the conditions that make your AR system lose track? Can you move a 


large distance and come back to your original location without too much drift? 
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(g) 


Figure 12.1 Depth estimation algorithms can convert a pair of color images (a—b) into a 
depth map (c) (Scharstein, Hirschmiiller et al. 2014) O 2014 Springer, a sequence of images 
(d) into a 3D model (e) (Knapitsch, Park et al. 2017) O 2017 ACM, or a single image (f) into 
a depth map (g) (Li, Dekel et al. 2019) O 2019 IEEE. 
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Stereo matching is the process of taking two or more images and building a 3D model of 
the scene by finding matching pixels in the images and converting their 2D positions into 3D 
depths. In Chapter 11, we described techniques for recovering camera positions and building 
sparse 3D models of scenes or objects. In this chapter, we address the question of how to build 
a more complete 3D model, e.g., a sparse or dense depth map that assigns relative depths to 
pixels in the input images. We also look at the topic of multi-view stereo algorithms that 
produce complete 3D volumetric or surface-based object models, as well as monocular depth 
recovery algorithms that infer plausible depths from just a single image. 

Why are people interested in depth estimation and stereo matching? From the earliest 
inquiries into visual perception, it was known that we perceive depth based on the differences 


in appearance between the left and right eye.! 


As a simple experiment, hold your finger 
vertically in front of your eyes and close each eye alternately. You will notice that the finger 
jumps left and right relative to the background of the scene. The same phenomenon is visible 
in the image pair shown in Figure 12.1a-—b, in which the foreground objects shift left and right 
relative to the background. 

As we will shortly see, under simple imaging configurations (both eyes or cameras look- 
ing straight ahead), the amount of horizontal motion or disparity is inversely proportional to 
the distance from the observer. While the basic physics and geometry relating visual disparity 
to scene structure are well understood (Section 12.1), automatically measuring this disparity 
by establishing dense and accurate inter-image correspondences is a challenging task. 

The earliest stereo matching algorithms were developed in the field of photogrammetry 
for automatically constructing topographic elevation maps from overlapping aerial images. 
Prior to this, operators would use photogrammetric stereo plotters, which displayed shifted 
versions of such images to each eye and allowed the operator to float a dot cursor around 
constant elevation contours. The development of fully automated stereo matching algorithms 
was a major advance in this field, enabling much more rapid and less expensive processing of 
aerial imagery (Hannah 1974; Hsieh, McKeown, and Perlant 1992). 

In computer vision, the topic of stereo matching has been one of the most widely studied 
and fundamental problems (Marr and Poggio 1976; Barnard and Fischler 1982; Dhond and 
Aggarwal 1989; Scharstein and Szeliski 2002; Brown, Burschka, and Hager 2003; Seitz, Cur- 
less et al. 2006), and continues to be one of the most active research areas (Poggi, Tosi et al. 
2021). While photogrammetric matching concentrated mainly on aerial imagery, computer 
vision applications include modeling the human visual system (Marr 1982), robotic naviga- 
tion and manipulation (Moravec 1983; Konolige 1997; Thrun, Montemerlo et al. 2006; Janai, 


'The word stereo comes from the Greek for solid; stereo vision is how we perceive solid shape (Koenderink 
1990). 
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© 0) 


Figure 12.2 Applications of stereo vision: (a) input image, (b) computed depth map, and 
(c) new view generation from multi-view stereo (Matthies, Kanade, and Szeliski 1989) O 1989 
Springer; (d) view morphing between two images (Seitz and Dyer 1996) © 1996 ACM; (e-f) 
3D face modeling (images courtesy of Frédéric Devernay); (2) z-keying live and computer- 
generated imagery (Kanade, Yoshida et al. 1996) © 1996 IEEE; (h-i) building 3D surface 
models from multiple video streams in Virtualized Reality (Kanade, Rander, and Narayanan 
1997) O 1997 IEEE; (j) computing depth maps for autonomous navigation (Geiger, Lenz, 
and Urtasun 2012) O 2012 IEEE. 
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Giiney et al. 2020) and Figures 12.2] and 11.26, as well as view interpolation and image-based 
rendering (Figure 12.2a-d), 3D model building (Figure 12.2e—f and h-i), mixing live action 
with computer-generated imagery (Figure 12.2g), and augmented reality (Valentin, Kowdle 
et al. 2018; Chaurasia, Nieuwoudt et al. 2020) and Figure 11.28. 

In this chapter, we describe the fundamental principles behind stereo matching, following 
the general taxonomy proposed by Scharstein and Szeliski (2002). We begin in Section 12.1 
with a review of the geometry of stereo image matching, i.e., how to compute for a given 
pixel in one image the range of possible locations the pixel might appear at in the other 
image, i.e., its epipolar line. We describe how to pre-warp images so that corresponding 
epipolar lines are coincident (rectification). We also describe a general resampling algorithm 
called plane sweep that can be used to perform multi-image stereo matching with arbitrary 
camera configurations. 

Next, we briefly survey techniques for the sparse stereo matching of interest points and 
edge-like features (Section 12.2). We then turn to the main topic of this chapter, namely 
the estimation of a dense set of pixel-wise correspondences in the form of a disparity map 
(Figure 12.1c). This involves first selecting a pixel matching criterion (Section 12.3) and then 
using either local area-based aggregation (Section 12.4), global optimization (Section 12.5), 
or deep networks (Section 12.6), to help disambiguate potential matches. In Section 12.7, we 
discuss multi-view stereo that use more than pairs of images in order to produce higher-quality 
depth maps or complete 3D object or scene models (Figure 12.1d—e). Finally, in Section 12.8 
we present algorithms for inferring depth from just a single image, which has now become 
possible using machine learning and deep networks. 

Throughout this chapter, we will often refer to datasets and benchmarks that have been 
used to develop depth inference algorithms and gauge their performance. Of these, the most 
widely used and influential include the Middlebury stereo and multi-view datasets bench- 
marks, which were among the first to keep up-to-date leaderboards, the EPFL multi-view 
dataset, the KITTI benchmarks for autonomous driving (stereo, flow, scene flow, and others), 
the DTU dataset, ETH3D benchmark, Tanks and Temples benchmark, and BlendedMVS 
dataset, which are all summarized in Table 12.1. Pointers to additional datasets can be found 
in Mayer, Ilg et al. (2018), Janai, Giiney et al. (2020), Laga, Jospin et al. (2020), and Poggi, 
Tosi et al. (2021). 
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Given a pixel in one image, how can we compute its correspondence in the other image? In 


Chapter 9, we saw that a variety of search techniques can be used to match pixels based on 
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Contents/Reference 


Middlebury stereo 
https://vision.middlebury.edu/stereo 
Middlebury multi-view 
https://vision.middlebury.edu/mview 
EPFL 
(no longer active) 
KITTI 2015 
http://www.cvlibs.net/datasets/kitti/eval_stereo_flow.php 
DTU 
https://roboimagedata.compute.dtu.dk/?page_id=36 
Freiburg Scene Flow 
https://Imb.informatik.uni-freiburg.de/resources/datasets 
ETH3D 
https://www.eth3d.net 
Tanks and Temples 
https://www.tanksandtemples.org 
BlendedMVS 
https://github.com/Yo Yo000/BlendedMVS 


Table 12.1 


33 high-resolution stereo pairs 
(Scharstein, Hirschmiiller et al. 2014) 

6 3D objects scanned from 300+ views 
(Seitz, Curless et al. 2006) 

6 outdoor multi-view sets of images 
(Strecha, von Hansen et al. 2008) 

200 train + 200 test stereo pairs 
(Menze and Geiger 2015) 

124 toy scenes with 49-64 images each 
(Jensen, Dahl et al. 2014) 

39k synthetic stereo pairs 
(Mayer, Ilg et al. 2018) 

13 training + 12 test high-res scenes 
(Schóps, Schonberger et al. 2017) 

7 training + 14 test 4K video scenes 
(Knapitsch, Park et al. 2017) 

17k MVS images covering 113 scenes 
(Yao, Luo et al. 2020) 


Widely used stereo datasets and benchmarks. 


their local appearance as well as the motions of neighboring pixels. In the case of stereo 
matching, however, we have some additional information available, namely the positions and 


calibration data for the cameras that took the pictures of the same static scene (Section 11.3). 


How can we exploit this information to reduce the number of potential correspondences, 
and hence both speed up the matching and increase its reliability? Figure 12.3a shows how a 
pixel in one image xy projects to an epipolar line segment in the other image. The segment 
is bounded at one end by the projection of the original viewing ray at infinity Poo and at the 
other end by the projection of the original camera center cy into the second camera, which 
is known as the epipole es. If we project the epipolar line in the second image back into the 
first, we get another line (segment), this time bounded by the other corresponding epipole 
ey. Extending both line segments to infinity, we get a pair of corresponding epipolar lines 
(Figure 12.3b), which are the intersection of the two image planes with the epipolar plane 
that passes through both camera centers cy and c; as well as the point of interest p (Faugeras 
and Luong 2001; Hartley and Zisserman 2004). 
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epipolar plane 
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Figure 12.3  Epipolar geometry: (a) epipolar line segment corresponding to one ray; (b) 


corresponding set of epipolar lines and their epipolar plane. 


12.1.1 Rectification 


As we saw in Section 11.3, the epipolar geometry for a pair of cameras is implicit in the 
relative pose and calibrations of the cameras, and can easily be computed from seven or more 
point matches using the fundamental matrix (or five or more points for the calibrated essential 
matrix) (Zhang 1998a,b; Faugeras and Luong 2001; Hartley and Zisserman 2004). Once this 
geometry has been computed, we can use the epipolar line corresponding to a pixel in one 
image to constrain the search for corresponding pixels in the other image. One way to do this 
1s to use a general correspondence algorithm, such as optical flow (Section 9.3), but to only 
consider locations along the epipolar line (or to project any flow vectors that fall off back onto 
the line). 

A more efficient algorithm can be obtained by first rectifying (1.e., warping) the input 
images so that corresponding horizontal scanlines are epipolar lines (Loop and Zhang 1999; 
Faugeras and Luong 2001; Hartley and Zisserman 2004).? Afterwards, it is possible to match 
horizontal scanlines independently or to shift images horizontally while computing matching 
scores (Figure 12.4). 

A simple way to rectify the two images is to first rotate both cameras so that they are 
looking perpendicular to the line joining the camera centers cy and cı. As there is a de- 


gree of freedom in the tilt, the smallest rotations that achieve this should be used. Next, to 


?This makes most sense if the cameras are next to each other, although by rotating the cameras, rectification can 
be performed on any pair that is not verged too much or has too much of a scale change. In those latter cases, using 
plane sweep (below) or hypothesizing small planar patch locations in 3D (Goesele, Snavely et al. 2007) may be 
preferable. 


756 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


(b) 


(d) 


Figure 12.4 The multi-stage stereo rectification algorithm of Loop and Zhang (1999) O 
1999 IEEE. (a) Original image pair overlaid with several epipolar lines; (b) images trans- 
formed so that epipolar lines are parallel; (c) images rectified so that epipolar lines are 
horizontal and in vertical correspondence; (d) final rectification that minimizes horizontal 


distortions. 


determine the desired twist around the optical axes, make the up vector (the camera y-axis) 
perpendicular to the camera center line. This ensures that corresponding epipolar lines are 
horizontal and that the disparity for points at infinity is O. Finally, re-scale the images, if 
necessary, to account for different focal lengths, magnifying the smaller image to avoid alias- 
ing. (The full details of this procedure can be found in Fusiello, Trucco, and Verri (2000) 
and Exercise 12.1.) When additional information about the imaging process is available, e.g., 
that the images were formed on co-planar photographic plates, more specialized and accurate 
algorithms can be developed (Luo, Kong et al. 2020). Note that in general, it is not possi- 
ble to rectify an arbitrary collection of images simultaneously unless their optical centers are 
collinear, although rotating the cameras so that they all point in the same direction reduces 
the inter-camera pixel movements to scalings and translations. 

The resulting standard rectified geometry is employed in a lot of stereo camera setups and 
stereo algorithms, and leads to a very simple inverse relationship between 3D depths Z and 
disparities d, 

d=f a (12.1) 


where f is the focal length (measured in pixels), B is the baseline, and 


2 =a2+d(z,y), y =y (12.2) 
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Figure 12.5 Slices through a typical disparity space image (DSI) (Scharstein and Szeliski 


2002) O 2002 Springer: (a) original color image; (b) ground truth disparities; (c—e) three 
(x, y) slices for d = 10, 16, 21; (f) an (x, d) slice for y = 151 (the dashed line in (b)). Various 
dark (matching) regions are visible in (c—e), e.g., the bookshelves, table and cans, and head 
statue, and three disparity levels can be seen as horizontal lines in (f). The dark bands in the 
DSIs indicate regions that match at this disparity. (Smaller dark regions are often the result of 
textureless regions.) Additional examples of DSIs are discussed by Bobick and Intille (1999). 


describes the relationship between corresponding pixel coordinates in the left and right im- 
ages (Bolles, Baker, and Marimont 1987; Okutomi and Kanade 1993; Scharstein and Szeliski 
2002).* The task of extracting depth from a set of images then becomes one of estimating the 
disparity map d(x,y). 

After rectification, we can easily compare the similarity of pixels at corresponding lo- 
cations (x,y) and (x’,y’) = (a + d, y) and store them in a disparity space image (DSI) 
C(x, y, d) for further processing (Figure 12.5). The concept of the disparity space (x, y, d) 
dates back to early work in stereo matching (Marr and Poggio 1976), while the concept of a 
disparity space image (volume) is generally associated with Yang, Yuille, and Lu (1993) and 
Intille and Bobick (1994). 


12.1.2 Plane sweep 


An alternative to pre-rectifying the images before matching is to sweep a set of planes through 
the scene and to measure the photoconsistency of different images as they are re-projected 
onto these planes (Figure 12.6). This process is commonly known as the plane sweep algo- 
rithm (Collins 1996; Szeliski and Golland 1999; Saito and Kanade 1999). 


3The term disparity was first introduced in the human vision literature to describe the difference in location of 


corresponding features seen by the left and right eyes (Marr 1982). Horizontal disparity is the most commonly 
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. Input image k o | 
Virtual camera k lies 


(a) (b) 


Figure 12.6 Sweeping a set of planes through a scene (Szeliski and Golland 1999) © 1999 
Springer: (a) The set of planes seen from a virtual camera induces a set of homographies in 
any other source (input) camera image. (b) The warped images from all the other cameras can 
be stacked into a generalized disparity space volume I(a, y, d, k) indexed by pixel location 
(x, y), disparity d, and camera k. 


As we saw in Section 2.1.4, where we introduced projective depth (also known as plane 
plus parallax (Kumar, Anandan, and Hanna 1994; Sawhney 1994; Szeliski and Coughlan 
1997)), the last row of a full-rank 4 x 4 projection matrix P can be set to an arbitrary plane 
equation p3 = s3[ño|co]. The resulting four-dimensional projective transform (collineation) 
(2.68) maps 3D world points p = (X,Y, Z,1) into screen coordinates x, = (£s, Ys, l, d), 
where the projective depth (or parallax) d (2.66) is 0 on the reference plane (Figure 2.11). 

Sweeping d through a series of disparity hypotheses, as shown in Figure 12.6a, corre- 
sponds to mapping each input image into the virtual camera P defining the disparity space 
through a series of homographies (2.68-2.71), 


Xk ~ PP tx, = Hyx + tad = (Ay + ta, [00 dx, (12.3) 


as shown in Figure 2.12b, where x, and x are the homogeneous pixel coordinates in the 
source and virtual (reference) images (Szeliski and Golland 1999). The members of the 
family of homographies H,(d) = Hy, +41 [0 0 d], which are parameterized by the addition of 
a rank-1 matrix, are related to each other through a planar homology (Hartley and Zisserman 
2004, A5.2). 

The choice of virtual camera and parameterization is application dependent and is what 


gives this framework a lot of its flexibility. In many applications, one of the input cameras (the 


studied phenomenon, but vertical disparity is possible if the eyes are verged. 
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reference camera) is used, thus computing a depth map that is registered with one of the input 
images and which can later be used for image-based rendering (Sections 14.1 and 14.2). In 
other applications, such as view interpolation for gaze correction in video-conferencing (Sec- 
tion 12.4.2) (Ott, Lewis, and Cox 1993; Criminisi, Shotton et al. 2003), a camera centrally 
located between the two input cameras is preferable, because it provides the needed per-pixel 
disparities to hallucinate the virtual middle image. 

The choice of disparity sampling, i.e., the setting of the zero parallax plane and the scaling 
of integer disparities, is also application dependent, and is usually set to bracket the range of 
interest, i.e., the working volume, while scaling disparities to sample the image in pixel (or 
sub-pixel) shifts. For example, when using stereo vision for obstacle avoidance in robot 
navigation, it is most convenient to set up disparity to measure per-pixel elevation above the 
ground (Ivanchenko, Shen, and Coughlan 2009). 

As each input image is warped onto the current planes parameterized by disparity d, it 
can be stacked into a generalized disparity space image i (x, y, d, k) for further processing 
(Figure 12.6b) (Szeliski and Golland 1999). In most stereo algorithms, the photoconsistency 
(e.g., sum of squared or robust differences) with respect to the reference image J, is calculated 
and stored in the DSI 


C(z,y, d) = >> p(H(z, y, d, k) — I(x, y)). (12.4) 
k 
However, it is also possible to compute alternative statistics such as robust variance, focus, or 
entropy (Section 12.3.1) (Vaish, Szeliski et al. 2006) or to use this representation to reason 
about occlusions (Szeliski and Golland 1999; Kang and Szeliski 2004). The generalized DSI 
will come in particularly handy when we come back to the topic of multi-view stereo in 
Section 12.7.2. 

Of course, planes are not the only surfaces that can be used to define a 3D sweep through 
the space of interest. Cylindrical surfaces, especially when coupled with panoramic photog- 
raphy (Section 8.2), are often used (Ishiguro, Yamamoto, and Tsuji 1992; Kang and Szeliski 
1997; Shum and Szeliski 1999; Li, Shum et al. 2004; Zheng, Kang et al. 2007). It is also 
possible to define other manifold topologies, e.g., ones where the camera rotates around a 
fixed axis (Seitz 2001). 

Once the DSI has been computed, the next step in most stereo correspondence algorithms 
is to produce a univalued function in disparity space d(x, y) that best describes the shape of 
the surfaces in the scene. This can be viewed as finding a surface embedded in the disparity 
space image that has some optimality property, such as lowest cost and best (piecewise) 
smoothness (Yang, Yuille, and Lu 1993). Figure 12.5 shows examples of slices through a 
typical DSI. More figures of this kind can be found in the paper by Bobick and Intille (1999). 
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12.2 Sparse correspondence 


Early stereo matching algorithms were feature-based, i.e., they first extracted a set of poten- 
tially matchable image locations, using either interest operators or edge detectors, and then 
searched for corresponding locations in other images using a patch-based metric (Hannah 
1974; Marr and Poggio 1979; Mayhew and Frisby 1980; Baker and Binford 1981; Arnold 
1983; Grimson 1985; Ohta and Kanade 1985; Bolles, Baker, and Marimont 1987; Matthies, 
Kanade, and Szeliski 1989; Hsieh, McKeown, and Perlant 1992; Bolles, Baker, and Hannah 
1993). This limitation to sparse correspondences was partially due to computational resource 
limitations, but was also driven by a desire to limit the answers produced by stereo algorithms 
to matches with high certainty. In some applications, there was also a desire to match scenes 
with potentially very different illuminations, where edges might be the only stable features 
(Collins 1996). Such sparse 3D reconstructions could later be interpolated using surface fit- 
ting algorithms such as those discussed in Sections 4.2 and 13.3.1. 

More recent work in this area has focused on first extracting highly reliable features and 
then using these as seeds to grow additional matches (Zhang and Shan 2000; Lhuillier and 
Quan 2002; Cech and Sára 2007) or as inputs to a dense per-pixel depth solver (Valentin, 
Kowdle et al. 2018). Similar approaches have also been extended to wide baseline multi- 
view stereo problems and combined with 3D surface reconstruction (Lhuillier and Quan 2005; 
Strecha, Tuytelaars, and Van Gool 2003; Goesele, Snavely et al. 2007) or free-space reasoning 
(Taylor 2003), as described in more detail in Section 12.7. 


12.2.1 3D curves and profiles 


Another example of sparse correspondence is the matching of profile curves (or occluding 
contours), which occur at the boundaries of objects (Figure 12.7) and at interior self occlu- 
sions, Where the surface curves away from the camera viewpoint. 

The difficulty in matching profile curves is that in general, the locations of profile curves 
vary as a function of camera viewpoint. Therefore, matching curves directly in two images 
and then triangulating these matches can lead to erroneous shape measurements. Fortunately, 
if three or more closely spaced frames are available, it is possible to fit a local circular arc to 
the locations of corresponding edgels (Figure 12.7a) and therefore obtain semi-dense curved 
surface meshes directly from the matches (Figures 12.7c and g). Another advantage of match- 
ing such curves is that they can be used to reconstruct surface shape for untextured surfaces, 
so long as there is a visible difference between foreground and background colors. 

Over the years, a number of different techniques have been developed for reconstructing 
surface shape from profile curves (Giblin and Weiss 1987; Cipolla and Blake 1992; Vaillant 
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(f) (g) 


Figure 12.7 Surface reconstruction from occluding contours (Szeliski and Weiss 1998) 
© 2002 Springer: (a) circular arc fitting in the epipolar plane; (b) synthetic example of 
an ellipsoid with a truncated side and elliptic surface markings; (c) partially reconstructed 
surface mesh seen from an oblique and top-down view; (d) real-world image sequence of a 
soda can on a turntable; (e) extracted edges; (f) partially reconstructed profile curves; (g) 
partially reconstructed surface mesh. (Partial reconstructions are shown so as not to clutter 


the images.) 


and Faugeras 1992; Zheng 1994; Boyer and Berger 1997; Szeliski and Weiss 1998). Cipolla 
and Giblin (2000) describe many of these techniques, as well as related topics such as in- 
ferring camera motion from profile curve sequences. Below, we summarize the approach 
developed by Szeliski and Weiss (1998), which assumes a discrete set of images, rather than 
formulating the problem in a continuous differential framework. 

Let us assume that the camera is moving smoothly enough that the local epipolar geometry 
varies slowly, i.e., the epipolar planes induced by the successive camera centers and an edgel 
under consideration are nearly co-planar. The first step in the processing pipeline is to extract 
and link edges in each of the input images (Figures 12.7b and e). Next, edgels in successive 
images are matched using pairwise epipolar geometry, proximity and (optionally) appearance. 
This provides a linked set of edges in the spatio-temporal volume, which is sometimes called 
the weaving wall (Baker 1989). 


To reconstruct the 3D location of an individual edgel, along with its local in-plane normal 
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and curvature, we project the viewing rays corresponding to its neighbors onto the instanta- 
neous epipolar plane defined by the camera center, the viewing ray, and the camera velocity, 
as shown in Figure 12.7a. We then fit an osculating circle to the projected lines, from which 
we can compute a 3D point position (Szeliski and Weiss 1998). 

The resulting set of 3D points, along with their spatial (in-image) and temporal (between- 
image) neighbors, form a 3D surface mesh with local normal and curvature estimates (Fig- 
ures 12.7c and g). Note that whenever a curve is due to a surface marking or a sharp crease 
edge, rather than a smooth surface profile curve, this shows up as a 0 or small radius of curva- 
ture. Such curves result in isolated 3D space curves, rather than elements of smooth surface 
meshes, but can still be incorporated into the 3D surface model during a later stage of surface 
interpolation (Section 13.3.1). 

More recent examples of 3D curve reconstruction from sequences of RGB and RGB- 
D images include (Li, Yao et al. 2018; Liu, Chen et al. 2018; Wang, Liu et al. 2020), the 
latest of which can even recover camera pose with untextured backgrounds. When the thin 
structures being modeled are planar manifolds, such as leaves or paper, as opposed to true 
3D curves such as wires, specially tailored mesh representations may be more appropriate 
(Kim, Zimmer et al. 2013; Yiicer, Kim et al. 2016; Yticer, Sorkine-Hornung ef al. 2016), as 


discussed in more detail in Sections 12.7.2 and 14.3. 


12.3 Dense correspondence 


While sparse matching algorithms are still occasionally used, most stereo matching algo- 
rithms today focus on dense correspondence, as this is required for applications such as 
image-based rendering or modeling. This problem is more challenging than sparse corre- 
spondence, because inferring depth values in textureless regions requires a certain amount of 
guesswork. (Think of a solid colored background seen through a picket fence. What depth 
should it be?) 

In this section, we review the taxonomy and categorization scheme for dense correspon- 
dence algorithms first proposed by Scharstein and Szeliski (2002). The taxonomy consists 
of a set of algorithmic “building blocks” from which a large set of algorithms can be con- 
structed. It is based on the observation that stereo algorithms generally perform some subset 


of the following four steps: 
1. matching cost computation; 
2. cost (support) aggregation; 


3. disparity computation and optimization; and 
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4. disparity refinement. 


For example, local (window-based) algorithms (Section 12.4), where the disparity com- 
putation at a given point depends only on intensity values within a finite window, usually 
make implicit smoothness assumptions by aggregating support. Some of these algorithms 
can cleanly be broken down into steps 1, 2, 3. For example, the traditional sum-of-squared- 


differences (SSD) algorithm can be described as: 
1. The matching cost is the squared difference of intensity values at a given disparity. 


2. Aggregation is done by summing the matching cost over square windows with constant 


disparity. 


3. Disparities are computed by selecting the minimal (winning) aggregated value at each 


pixel. 


Some local algorithms, however, combine steps | and 2 and use a matching cost that is based 
on a support region, e.g., normalized cross-correlation (Hannah 1974; Bolles, Baker, and 
Hannah 1993) and the rank transform (Zabih and Woodfill 1994) and other ordinal measures 
(Bhat and Nayar 1998). (This can also be viewed as a preprocessing step; see Section 12.3.1.) 

Global algorithms, on the other hand, make explicit smoothness assumptions and then 
solve a global optimization problem (Section 12.5). Such algorithms typically do not per- 
form an aggregation step, but rather seek a disparity assignment (step 3) that minimizes a 
global cost function that consists of data (step 1) terms and smoothness terms. The main dis- 
tinction among these algorithms is the minimization procedure used, e.g., simulated anneal- 
ing (Marroquin, Mitter, and Poggio 1987; Barnard 1989), probabilistic (mean-field) diffusion 
(Scharstein and Szeliski 1998), expectation maximization (EM) (Birchfield, Natarajan, and 
Tomasi 2007), graph cuts (Boykov, Veksler, and Zabih 2001), or loopy belief propagation 
(Sun, Zheng, and Shum 2003), to name just a few. 

In between these two broad classes are certain iterative algorithms that do not explic- 
itly specify a global function to be minimized, but whose behavior mimics closely that of 
iterative optimization algorithms (Marr and Poggio 1976; Zitnick and Kanade 2000). Hier- 
archical (coarse-to-fine) algorithms resemble such iterative algorithms, but typically operate 
on an image pyramid where results from coarser levels are used to constrain a more local 
search at finer levels (Witkin, Terzopoulos, and Kass 1987; Quam 1984; Bergen, Anandan et 
al. 1992). Also situated between local and global methods is semi-global-matching (SGM) 
(Hirschmiiller 2008), which approximates minimizing a 2D cost function via 1D optimiza- 
tion (see Section 12.5.1), as well as methods that avoid exploring the whole search space, 
e.g., PatchMatch stereo (Bleyer, Rhemann, and Rother 2011) and local plane sweeps (LPS) 
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(Sinha, Scharstein, and Szeliski 2014). A large number of neural network algorithms have 
also been developed for stereo matching, which we review in Section 12.6. 

While most stereo matching algorithms produce a single disparity map with respect to a 
reference input image, or a path through the disparity space that encodes a continuous surface 
(Figure 12.13), a few algorithms compute fractional opacity values along with depths and 
colors for each pixel (Szeliski and Golland 1999; Zhou, Tucker et al. 2018; Flynn, Broxton 
et al. 2019). As these are closely related to volumetric reconstruction techniques, we discuss 


them in Section 12.7.2 as well as Section 14.2.1 on image-based rendering with layers. 


12.3.1 Similarity measures 


The first component of any dense stereo matching algorithm is a similarity measure that 
compares pixel values in order to determine how likely they are to be in correspondence. In 
this section, we briefly review the similarity measures introduced in Section 9.1 and mention a 
few others that have been developed specifically for stereo matching (Scharstein and Szeliski 
2002; Hirschmiiller and Scharstein 2009). 

The most common pixel-based matching costs include sums of squared intensity differ- 
ences (SSD) (Hannah 1974) and absolute intensity differences (SAD) (Kanade 1994). In 
the video processing community, these matching criteria are referred to as the mean-squared 
error (MSE) and mean absolute difference (MAD) measures; the term displaced frame dif- 
ference is also often used (Tekalp 1995). 

More recently, robust measures (9.2), including truncated quadratics and contaminated 
Gaussians, have been proposed (Black and Anandan 1996; Black and Rangarajan 1996; 
Scharstein and Szeliski 1998; Barron 2019). These measures are useful because they limit the 
influence of mismatches during aggregation. Vaish, Szeliski et al. (2006) compare a number 
of such robust measures, including a new one based on the entropy of the pixel values at each 
disparity hypothesis (Zitnick, Kang et al. 2004), which is particularly useful in multi-view 
stereo. 

Other traditional matching costs include normalized cross-correlation (9.11) (Hannah 
1974; Bolles, Baker, and Hannah 1993; Evangelidis and Psarakis 2008), which behaves 
similarly to sum-of-squared-differences (SSD), and binary matching costs (i.e., match or no 
match) (Marr and Poggio 1976), based on binary features such as edges (Baker and Binford 
1981; Grimson 1985) or the sign of the Laplacian (Nishihara 1984). Because of their poor 
discriminability, simple binary matching costs are no longer used in dense stereo matching. 

Some costs are insensitive to differences in camera gain or bias, for example gradient- 
based measures (Seitz 1989; Scharstein 1994), phase and filter-bank responses (Marr and 
Poggio 1979; Kass 1988; Jenkin, Jepson, and Tsotsos 1991; Jones and Malik 1992), filters 
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(a) Intensity image (b) Mean filter (c) LOG filter (d) BilSub filter (e) Rank filter (f) SoftRank filter 


Figure 12.8 Various similarity measures (pre-processing filters) studied in (Hirschmiiller 
and Scharstein 2009) O 2009 IEEE. The contrast of (b)-(d) has been increased for better 
visualization. 


that remove regular or robust (bilaterally filtered) means (Ansar, Castano, and Matthies 2004; 
Hirschmiiller and Scharstein 2009), dense feature descriptor (Tola, Lepetit, and Fua 2010), 
and non-parametric measures such as rank and census transforms (Zabih and Woodfill 1994), 
ordinal measures (Bhat and Nayar 1998), or entropy (Zitnick, Kang et al. 2004; Zitnick and 
Kang 2007). The census transform, which converts each pixel inside a moving window into 
a bit vector representing which neighbors are above or below the central pixel, was found 
by Hirschmiiller and Scharstein (2009) to be quite robust against large-scale, non-stationary 
exposure and illumination changes. Figure 12.8 shows a few of the transformations that can 
be applied to images to improve their similarity across illumination variations. 

It is also possible to correct for differing global camera characteristics by performing 
a preprocessing or iterative refinement step that estimates inter-image bias—gain variations 
using global regression (Gennert 1988), histogram equalization (Cox, Roy, and Hingorani 
1995), or mutual information (Kim, Kolmogorov, and Zabih 2003; Hirschmiiller 2008). Lo- 
cal, smoothly varying compensation fields have also been proposed (Strecha, Tuytelaars, and 
Van Gool 2003; Zhang, McMillan, and Yu 2006). 

To compensate for sampling issues, i.e., dramatically different pixel values in high-frequency 
areas, Birchfield and Tomasi (1998) proposed a matching cost that is less sensitive to shifts in 
image sampling. Rather than just comparing pixel values shifted by integral amounts (which 
may miss a valid match), they compare each pixel in the reference image against a linearly in- 
terpolated function of the other image. More detailed studies of these and additional matching 
costs are explored in Szeliski and Scharstein (2004) and Hirschmiiller and Scharstein (2009). 
In particular, if you expect there to be significant exposure or appearance variation between 
images that you are matching, some of the more robust measures that performed well in the 
evaluation by Hirschmiiller and Scharstein (2009), such as the census transform (Zabih and 
Woodfill 1994), ordinal measures (Bhat and Nayar 1998), bilateral subtraction (Ansar, Cas- 
tano, and Matthies 2004), or hierarchical mutual information (Hirschmiiller 2008), should 


be used. Interestingly, color information does not appear to help when utilized in matching 
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costs (Bleyer and Chambon 2010), although it is important for aggregation (discussed in next 
section). When matching more than pairs of images, more sophisticated variants of similarity 
(photoconsistency) measures can be used, as discussed in Section 12.7 and (Furukawa and 
Hernández 2015, Chapter 2). 

More recently, one of the first successes of deep learning for stereo was the learning 
of matching costs. Zbontar and LeCun (2016) trained a neural network to compare image 
patches, trained on data extracted from the Middlebury (Scharstein, Hirschmiiller et al. 2014) 
and KITTI (Geiger, Lenz, and Urtasun 2012) datasets. This matching cost is still widely used 


in top-performing methods on these two benchmarks. 


12.4 Local methods 


Local and window-based methods aggregate the matching cost by summing or averaging 
over a support region in the DSI C(x, y, d).* A support region can be either two-dimensional 
at a fixed disparity (favoring fronto-parallel surfaces), or three-dimensional in x-y-d space 
(supporting slanted surfaces). Two-dimensional evidence aggregation has been implemented 
using square windows or Gaussian convolution (traditional), multiple windows anchored at 
different points, i.e., shiftable windows (Arnold 1983; Fusiello, Roberto, and Trucco 1997; 
Bobick and Intille 1999), windows with adaptive sizes (Okutomi and Kanade 1992; Kanade 
and Okutomi 1994; Kang, Szeliski, and Chai 2001; Veksler 2001, 2003), windows based on 
connected components of constant disparity (Boykov, Veksler, and Zabih 1998), the results of 
color-based segmentation (Yoon and Kweon 2006; Tombari, Mattoccia et al. 2008), or with 
a guided filter (Hosni, Rhemann ef al. 2013). Three-dimensional support functions that have 
been proposed include limited disparity difference (Grimson 1985), limited disparity gradi- 
ent (Pollard, Mayhew, and Frisby 1985), Prazdny’s coherence principle (Prazdny 1985), and 
the work by Zitnick and Kanade (2000), which includes visibility and occlusion reasoning. 
PatchMatch stereo (Bleyer, Rhemann, and Rother 2011), discussed in more detail below, also 
does aggregation in 3D via slanted support windows. 


Aggregation with a fixed support region can be performed using 2D or 3D convolution, 
C(x, y, d) = w(x, y, d) * Co(x,y, d), (12.5) 


or, in the case of rectangular windows, using efficient moving average box-filters (Sec- 
tion 3.2.2) (Kanade, Yoshida et al. 1996; Kimura, Shinbo et al. 1999). Shiftable windows can 
also be implemented efficiently using a separable sliding min-filter (Figure 12.9) (Scharstein 


“For two surveys and comparisons of such techniques, please see the work of Gong, Yang et al. (2007) and 
Tombari, Mattoccia et al. (2008). 
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Figure 12.9 Shiftable window (Scharstein and Szeliski 2002) © 2002 Springer. The effect 
of trying all 3 x 3 shifted windows around the black pixel is the same as taking the minimum 
matching score across all centered (non-shifted) windows in the same neighborhood. (For 


clarity, only three of the neighboring shifted windows are shown here.) 


Figure 12.10 Aggregation window sizes and weights adapted to image content (Tombari, 
Mattoccia et al. 2008) © 2008 IEEE: (a) original image with selected evaluation points; 
(b) variable windows (Veksler 2003); (c) adaptive weights (Yoon and Kweon 2006); (d) 
segmentation-based (Tombari, Mattoccia, and Di Stefano 2007). Notice how the adaptive 


weights and segmentation-based techniques adapt their support to similarly colored pixels. 


and Szeliski 2002, Section 4.2). Selecting among windows of different shapes and sizes can 
be performed more efficiently by first computing a summed area table (Section 3.2.3, 3.30- 
3.32) (Veksler 2003). Selecting the right window is important, because windows must be 
large enough to contain sufficient texture and yet small enough so that they do not straddle 
depth discontinuities (Figure 12.10). An alternative method for aggregation is iterative diffu- 
sion, 1.e., repeatedly adding to each pixel’s cost the weighted values of its neighboring pixels’ 
costs (Szeliski and Hinton 1985; Shah 1993; Scharstein and Szeliski 1998). 

Of the local aggregation methods compared by Gong, Yang et al. (2007) and Tombari, 
Mattoccia et al. (2008), the fast variable window approach of Veksler (2003) and the locally 
weighting approach developed by Yoon and Kweon (2006) consistently stood out as having 
the best tradeoff between performance and speed. The local weighting technique, in partic- 


Extensive results from Tombari, Mattoccia ef al. (2008) can be found at http://www.vision.deis.unibo.it/spe. 
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ular, is interesting because, instead of using square windows with uniform weighting, each 
pixel within an aggregation window influences the final matching cost based on its color sim- 
ilarity and spatial distance, just as in bilateral filtering (Figure 12.10c). (In fact, their aggrega- 
tion step is closely related to doing a joint bilateral filter on the color/disparity image, except 
that it is done symmetrically in both reference and target images.) The segmentation-based 
aggregation method of Tombari, Mattoccia, and Di Stefano (2007) did even better, although 
a fast implementation of this algorithm does not yet exist. Another approach to aggregation 
is to aggregate along one or more minimum spanning trees based on pixel similarities (Yang 
2015; Li, Yu et al. 2017). 

In local methods, the emphasis is on the matching cost computation and cost aggregation 
steps. Computing the final disparities is trivial: simply choose at each pixel the disparity 
associated with the minimum cost value. Thus, these methods perform a local “winner- 
take-all” (WTA) optimization at each pixel. A limitation of this approach (and many other 
correspondence algorithms) is that uniqueness of matches is only enforced for one image 
(the reference image), while points in the other image might match multiple points, unless 
cross-checking and subsequent hole filling is used (Fua 1993; Hirschmiiller and Scharstein 
2009). 


12.4.1 Sub-pixel estimation and uncertainty 


Most stereo correspondence algorithms compute a set of disparity estimates in some dis- 
cretized space, e.g., for integer disparities (exceptions include continuous optimization tech- 
niques such as optical flow (Bergen, Anandan et al. 1992) or splines (Szeliski and Coughlan 
1997)). For applications such as robot navigation or people tracking, these may be perfectly 
adequate. However for image-based rendering, such quantized maps lead to very unappeal- 
ing view synthesis results, i.e., the scene appears to be made up of many thin shearing layers. 
To remedy this situation, many algorithms apply a sub-pixel refinement stage after the initial 
discrete correspondence stage. (An alternative is to simply start with more discrete disparity 
levels (Szeliski and Scharstein 2004).) 

Sub-pixel disparity estimates can be computed in a variety of ways, including iterative 
gradient descent and fitting a curve to the matching costs at discrete disparity levels (Ryan, 
Gray, and Hunt 1980; Lucas and Kanade 1981; Tian and Huhns 1986; Matthies, Kanade, 
and Szeliski 1989; Kanade and Okutomi 1994). This provides an easy way to increase the 
resolution of a stereo algorithm with little additional computation. However, to work well, 
the intensities being matched must vary smoothly, and the regions over which these estimates 
are computed must be on the same (correct) surface. 


Some questions have been raised about the advisability of fitting correlation curves to 
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integer-sampled matching costs (Shimizu and Okutomi 2001). This situation may even be 
worse when sampling-insensitive dissimilarity measures are used (Birchfield and Tomasi 
1998). These issues are explored in more depth by Szeliski and Scharstein (2004) and Haller 
and Nedevschi (2012). 

Besides sub-pixel computations, there are other ways of post-processing the computed 
disparities. Occluded areas can be detected using cross-checking, i.e., comparing left-to- 
right and right-to-left disparity maps (Fua 1993). A median filter can be applied to clean 
up spurious mismatches, and holes due to occlusion can be filled by surface fitting or by 
distributing neighboring disparity estimates (Birchfield and Tomasi 1999; Scharstein 1999; 
Hirschmiiller and Scharstein 2009). 

Another kind of post-processing, which can be useful in later processing stages, is to asso- 
ciate confidences with per-pixel depth estimates (Figure 12.11), which can be done by looking 
at the curvature of the correlation surface, i.e., how strong the minimum in the DSI image is 
at the winning disparity. Matthies, Kanade, and Szeliski (1989) show that under the assump- 
tion of small noise, photometrically calibrated images, and densely sampled disparities, the 
variance of a local depth estimate can be estimated as 

52 
Var(d) = T (12.6) 
where a is the curvature of the DSI as a function of d, which can be measured using a local 
parabolic fit or by squaring all the horizontal gradients in the window, and ø? is the vari- 
ance of the image noise, which can be estimated from the minimum SSD score. (See also 
Section 8.1.4, (9.37), and Appendix B.6.) Over the years, a variety of stereo confidence mea- 
sures have been proposed. Hu and Mordohai (2012) and Poggi, Kim ef al. (2021) provide 


thorough surveys of this topic. 


12.4.2 Application: Stereo-based head tracking 


A common application of real-time stereo algorithms is for tracking the position of a user 
interacting with a computer or game system. The use of stereo can dramatically improve 
the reliability of such a system compared to trying to use monocular color and intensity 
information (Darrell, Gordon et al. 2000). Once recovered, this information can be used in 
a variety of applications, including controlling a virtual environment or game, correcting the 
apparent gaze during video conferencing, and background replacement. We discuss the first 


two applications below and defer the discussion of background replacement to Section 12.5.3. 


The use of head tracking to control a user’s virtual viewpoint while viewing a 3D object 


or environment on a computer monitor is sometimes called fish tank virtual reality, as the 
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(a) (b) (c) 


Figure 12.11 Uncertainty in stereo depth estimation (Szeliski 1991b): (a) input image; (b) 
estimated depth map (blue is closer); (c) estimated confidence(red is higher). As you can See, 
more textured areas have higher confidence. 


user is observing a 3D world as if it were contained inside a fish tank (Ware, Arthur, and 
Booth 1993). Early versions of these systems used mechanical head tracking devices and 
stereo glasses. Today, such systems can be controlled using stereo-based head tracking and 
stereo glasses can be replaced with autostereoscopic displays. Head tracking can also be used 
to construct a “virtual mirror”, where the user’s head can be modified in real time using a 
variety of visual effects (Darrell, Baker et al. 1997). 


Another application of stereo head tracking and 3D reconstruction is in gaze correction 
(Ott, Lewis, and Cox 1993). When a user participates in a desktop video conference or video 
chat, the camera is usually placed on top of the monitor. Because the person is gazing at a 
window somewhere on the screen, it appears as if they are looking down and away from the 
other participants, instead of straight at them. Replacing the single camera with two or more 
cameras enables a virtual view to be constructed right at the position where they are looking, 
resulting in virtual eye contact. Real-time stereo matching is used to construct an accurate 3D 
head model and view interpolation (Section 14.1) is used to synthesize the novel in-between 
view (Criminisi, Shotton et al. 2003). More recent publications on gaze correction in video 
conferencing include Kuster, Popa et al. (2012) and Kononenko and Lempitsky (2015), and 


the technology has been deployed in several commercial video conferencing systems.° 


Shttps://venturebeat.com/2019/10/03/microsofts-ai-powered-eye-gaze-tech-is-exclusive-to-the-surface-pro-x 
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12.5 Global optimization 


Global stereo matching methods perform some optimization or iteration steps after the dis- 
parity computation phase and often skip the aggregation step altogether, because the global 
smoothness constraints perform a similar function. Many global methods are formulated in 
an energy-minimization framework, where, as we saw in Chapters 4 (4.24—4.27) and 9, the 


objective is to find a solution d that minimizes a global energy, 
E(d) = Ep(d) + AEs(d). (12.7) 


The data term, Ep (d), measures how well the disparity function d agrees with the input image 


pair. Using our previously defined disparity space image, we define this energy as 


Ep(d) = X` C(x, y, d(x, y)), (12.8) 


(x,y) 


where C is the (initial or aggregated) matching cost DSI. 
The smoothness term Es(d) encodes the smoothness assumptions made by the algorithm. 
To make the optimization computationally tractable, the smoothness term is often restricted 


to measuring only the differences between neighboring pixels’ disparities, 


Es(d) = Y > p(d(z,y) — d(x +1,y)) + p(d(a,y) — d(x, y + 1), (12.9) 
(x,y) 


where p is some monotonically increasing function of disparity difference. It is also possible 
to use larger neighborhoods, such as Ng, which can lead to better boundaries (Boykov and 
Kolmogorov 2003), or to use second-order smoothness terms (Woodford, Reid et al. 2008), 
but such terms require more complex optimization techniques. An alternative to smooth- 
ness functionals is to use a lower-dimensional representation, such as splines (Szeliski and 
Coughlan 1997). 

In standard regularization (Section 4.2), p is a quadratic function, which makes d smooth 
everywhere and may lead to poor results at object boundaries. Energy functions that do not 
have this problem are called discontinuity-preserving and are based on robust p functions 
(Terzopoulos 1986b; Black and Rangarajan 1996). The seminal paper by Geman and Ge- 
man (1984) gave a Bayesian interpretation of these kinds of energy functions and proposed a 
discontinuity-preserving energy function based on Markov random fields (MRFs) and addi- 
tional line processes, which are additional binary variables that control whether smoothness 
penalties are enforced or not. Black and Rangarajan (1996) show how independent line pro- 


cess variables can be replaced by robust pairwise disparity terms. 
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The terms in Hs can also be made to depend on the intensity differences, e.g., 


pold(z, y) — d(x + 1,y)) - p(z, y) — (a +1,y)1), (12.10) 


where pr is some monotonically decreasing function of intensity differences that lowers 
smoothness costs at high-intensity gradients. This idea (Gamble and Poggio 1987; Fua 1993; 
Bobick and Intille 1999; Boykov, Veksler, and Zabih 2001) encourages disparity discontinu- 
ities to coincide with intensity or color edges and appears to account for some of the good 
performance of global optimization approaches. While most researchers set these functions 
heuristically, Pal, Weinman et al. (2012) show how the free parameters in such conditional 
random fields (Section 4.3, (4.47)) can be learned from ground truth disparity maps. 

Once the global energy has been defined, a variety of algorithms can be used to find a 
(local) minimum. Traditional approaches associated with regularization and Markov random 
fields include continuation (Blake and Zisserman 1987), simulated annealing (Geman and 
Geman 1984; Marroquin, Mitter, and Poggio 1987; Barnard 1989), highest confidence first 
(Chou and Brown 1990), and mean-field annealing (Geiger and Girosi 1991). 

Max-flow and graph cut methods have been proposed to solve a special class of global op- 
timization problems (Roy and Cox 1998; Boykov, Veksler, and Zabih 2001; Ishikawa 2003). 
Such methods are more efficient than simulated annealing and have produced good results, 
as have techniques based on loopy belief propagation (Sun, Zheng, and Shum 2003; Tappen 
and Freeman 2003). Appendix B.5 and survey papers on MRF inference (Szeliski, Zabih et 
al. 2008; Blake, Kohli, and Rother 2011; Kappes, Andres et al. 2015) discuss and compare 
such techniques in more detail. 

While global optimization techniques have largely been displaced by deep learning ap- 
proaches (Section 12.6) for datasets such as KITTI with large amounts of training images and 
high overlap with the test distributions, they still perform the best on challenging stereo pairs 
with fine details such as the high-resolution Middlebury pairs (Scharstein, Hirschmüller et al. 
2014). One example of such an approach is the local expansion moves algorithm developed 
by Taniai, Matsushita et al. (2018). Below, we describe some related techniques that are of 


historical interest, run faster, or are tailored to handle specific situations. 


Cooperative algorithms. Cooperative algorithms, inspired by computational models of hu- 
man stereo vision, were among the earliest methods proposed for disparity computation (Dev 
1974; Marr and Poggio 1976; Marroquin 1983; Szeliski and Hinton 1985; Zitnick and Kanade 
2000). Such algorithms iteratively update disparity estimates using non-linear operations 
based on neighboring disparity and matching values and result in an overall behavior similar 
to global optimization algorithms. In fact, for some of these algorithms, it is possible to ex- 
plicitly state a global function that is being minimized (Scharstein and Szeliski 1998). There 
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Figure 12.12 Stereo matching using local plane sweeps (Sinha, Scharstein, and Szeliski 
2014) O 2014 IEEE: (a) input image; (b) initial sparse matches; (c) matches grouped by 
slanted planes; (d) 3D visualization of planes and grouped features. 


are also iterative algorithms that look at a larger neighborhood in the image, such as Patch- 
Match Stereo (Bleyer, Rhemann, and Rother 2011), which estimates a local 3D plane at each 
pixel and uses the non-local PatchMatch algorithm (Barnes, Shechtman et al. 2009) to quickly 
find approximate nearest neighbors in plane space. This approach has recently been applied 
to the multi-view stereo setting to produce an extremely time and space-efficient high-quality 
algorithm (Wang, Galliani et al. 2021). 


Coarse-to-fine and incremental warping. Most of today’s best algorithms first enumer- 
ate all possible matches at all possible disparities and then select the best set of matches in 
some way. Faster approaches can sometimes be obtained using methods inspired by classic 
(infinitesimal) optical flow computation. Here, images are successively warped and dispar- 
ity estimates incrementally updated until a satisfactory registration is achieved. These tech- 
niques are most often implemented within a coarse-to-fine hierarchical refinement framework 
(Quam 1984; Bergen, Anandan et al. 1992; Barron, Fleet, and Beauchemin 1994; Szeliski and 
Coughlan 1997). Recently, coarse-to-fine or pyramid approaches have been having a renais- 
sance in modern deep networks, applied both to optical flow (Ranjan and Black 2017; Sun, 
Yang et al. 2018) and stereo (Chang and Chen 2018). 


Local plane sweeps. Instead of sweeping planes perpendicular to the viewing direction, it is 
also possible to model the scene using a collection of slanted planes, which is beneficial if the 
scene contains highly slanted planar surfaces such as floors or walls, as shown in Figure 12.12 
(Sinha, Scharstein, and Szeliski 2014). Once such planes have been estimated and pixels 
assigned to each plane, it is then possible to estimate per-pixel out-of-plane displacements 
to better model curved surfaces. Slanted planes were also used earlier in the the PatchMatch 
stereo algorithm (Bleyer, Rhemann, and Rother 2011), and have also been used more recently 
in the planar bilateral solver used for smartphone AR (Valentin, Kowdle et al. 2018). 
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Figure 12.13 Stereo matching using dynamic programming, as illustrated by (a) Scharstein 
and Szeliski (2002) © 2002 Springer and (b) Kolmogorov, Criminisi et al. (2006) © 2006 
IEEE. For each pair of corresponding scanlines, a minimizing path through the matrix of 
all pairwise matching costs (DSI) is selected. Lowercase letters (a-k) symbolize the intensi- 
ties along each scanline. Uppercase letters represent the selected path through the matrix. 
Matches are indicated by M, while partially occluded points (which have a fixed cost) are 
indicated by L or R, corresponding to points only visible in the left or right images, respec- 
tively. Usually, only a limited disparity range is considered (0—4 in the figure, indicated by 
the non-shaded squares). The representation in (a) allows for diagonal moves while the rep- 
resentation in (b) does not. Note that these diagrams, which use the Cyclopean representation 
of depth, i.e., depth relative to a camera between the two input cameras, show an “unskewed” 
x-d slice through the DSI. 


12.5.1 Dynamic programming 


A different class of global optimization algorithm is based on dynamic programming. While 
the 2D optimization of Equation (12.7) can be shown to be NP-hard for common classes 
of smoothness functions (Veksler 1999), dynamic programming can find the global mini- 
mum for independent scanlines in polynomial time. Dynamic programming was first used 
for stereo vision in sparse, edge-based methods (Baker and Binford 1981; Ohta and Kanade 
1985). More recent approaches have focused on the dense (intensity-based) scanline match- 
ing problem (Belhumeur 1996; Geiger, Ladendorf, and Yuille 1992; Cox, Hingorani ef al. 
1996; Bobick and Intille 1999; Birchfield and Tomasi 1999). These approaches work by 
computing the minimum-cost path through the matrix of all pairwise matching costs between 
two corresponding scanlines, i.e., through a horizontal slice of the DSI. Partial occlusion is 


handled explicitly by assigning a group of pixels in one image to a single pixel in the other 
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image. Figure 12.13 schematically shows how DP works, while Figure 12.5f shows a real 
DSI slice over which the DP is applied. 

To implement dynamic programming for a scanline y, each entry (state) in a 2D cost 
matrix D(m, n) is computed by combining its DSI matching cost value with one of its prede- 
cessor cost values while also including a fixed penalty for occluded pixels. The aggregation 
rules corresponding to Figure 12.13b are given by Kolmogorov, Criminisi et al. (2006), who 
also use a two-state foreground—background model for bi-layer segmentation. 

Problems with dynamic programming stereo include the selection of the right cost for oc- 
cluded pixels and the difficulty of enforcing inter-scanline consistency, although several meth- 
ods propose ways of addressing the latter (Ohta and Kanade 1985; Belhumeur 1996; Cox, 
Hingorani et al. 1996; Bobick and Intille 1999; Birchfield and Tomasi 1999; Kolmogorov, 
Criminisi et al. 2006). Another problem is that the dynamic programming approach requires 
enforcing the monotonicity or ordering constraint (Yuille and Poggio 1984). This constraint 
requires that the relative ordering of pixels on a scanline remain the same between the two 
views, which may not be the case in scenes containing narrow foreground objects. 

An alternative to traditional dynamic programming, introduced by Scharstein and Szeliski 
(2002), is to neglect the vertical smoothness constraints in (12.9) and simply optimize inde- 
pendent scanlines in the global energy function (12.7). The advantage of this scanline op- 
timization algorithm is that it computes the same representation and minimizes a reduced 
version of the same energy function as the full 2D energy function (12.7). Unfortunately, it 
still suffers from the same streaking artifacts as dynamic programming. Dynamic program- 
ming is also possible on tree structures, which can ameliorate the streaking (Veksler 2005). 

Much higher quality results can be obtained by summing up the cumulative cost function 
from multiple directions, e.g, from the eight cardinal directions, N, E, W, S, NE, SE, SW, 
NW (Hirschmiiller 2008). The resulting semi-global matching (SGM) algorithm performs 
quite well and is extremely efficient, enabling real-time low-power implementations (Gehrig, 
Eberli, and Meyer 2009). Drory, Haubold et al. (2014) show that SGM is equivalent to early 
stopping for a particular variant of belief propagation. Semi-global matching has also been 
extended using learned components, e.g., SGM-Net (Seki and Pollefeys 2017), which uses a 
CNN to adjust transition costs, and SGM-Forest (Schónberger, Sinha, and Pollefeys 2018), 


which uses a random-forest classifier to fuse disparity proposals from different directions. 


12.5.2 Segmentation-based techniques 


While most stereo matching algorithms perform their computations on a per-pixel basis, some 
techniques first segment the images into regions and then try to label each region with a 


disparity. 
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Figure 12.14 Segmentation-based stereo matching (Zitnick, Kang et al. 2004) O 2004 
ACM: (a) input color image; (b) color-based segmentation; (c) initial disparity estimates; 
(d) final piecewise-smoothed disparities; (e) MRF neighborhood defined over the segments 
in the disparity space distribution (Zitnick and Kang 2007) © 2007 Springer. 


EEE 


Figure 12.15 Stereo matching with adaptive over-segmentation and matting (Taguchi, 
Wilburn, and Zitnick 2008) O 2008 IEEE: (a) segment boundaries are refined during the op- 
timization, leading to more accurate results (e.g., the thin green leaf in the bottom row); (b) 
alpha mattes are extracted at segment boundaries, which leads to visually better compositing 
results (middle column). 


For example, Tao, Sawhney, and Kumar (2001) segment the reference image, estimate 
per-pixel disparities using a local technique, and then do local plane fits inside each segment 
before applying smoothness constraints between neighboring segments. Zitnick, Kang et al. 
(2004) and Zitnick and Kang (2007) use over-segmentation to mitigate initial bad segmen- 
tations. After a set of initial cost values for each segment has been stored into a disparity 
space distribution (DSD), iterative relaxation (or loopy belief propagation, in the more recent 
work of Zitnick and Kang (2007)) is used to adjust the disparity estimates for each segment, 
as shown in Figure 12.14. Taguchi, Wilburn, and Zitnick (2008) refine the segment shapes 
as part of the optimization process, which leads to much improved results, as shown in Fig- 
ure 12.15. 

Even more accurate results are obtained by Klaus, Sormann, and Karner (2006), who first 
segment the reference image using mean shift, run a small (3 x 3) SAD plus gradient SAD 
(weighted by cross-checking) to get initial disparity estimates, fit local planes, re-fit with 
global planes, and then run a final MRF on plane assignments with loopy belief propagation. 
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Figure 12.16 Multiframe matching using edges, planes, and superpixels (Xue, Owens et 
al. 2019) © 2019 Elsevier. 


When the algorithm was first introduced in 2006, it was the top ranked algorithm on the 
existing Middlebury benchmark. 

The algorithm by Wang and Zheng (2008) follows a similar approach of segmenting the 
image, doing local plane fits, and then performing cooperative optimization of neighboring 
plane fit parameters. The algorithm by Yang, Wang et al. (2009), uses the color correlation 
approach of Yoon and Kweon (2006) and hierarchical belief propagation to obtain an initial 
set of disparity estimates. Gallup, Frahm, and Pollefeys (2010) segment the image into planar 
and non-planar regions and use different representations for these two classes of surfaces. 

More recently, Xue, Owens et al. (2019) start by matching edges across a multi-frame 
stereo sequence and then fit overlapping square patches to obtain local plane hypotheses. 
These are then refined using superpixels and a final edge-aware relaxation to get continuous 
depth maps. 

Another important ability of segmentation-based stereo algorithms, which they share with 
algorithms that use explicit layers (Baker, Szeliski, and Anandan 1998; Szeliski and Golland 
1999) or boundary extraction (Hasinoff, Kang, and Szeliski 2006), is the ability to extract 
fractional pixel alpha mattes at depth discontinuities (Bleyer, Gelautz et al. 2009). This ability 
is crucial when attempting to create virtual view interpolation without clinging boundary 
or tearing artifacts (Zitnick, Kang et al. 2004) and also to seamlessly insert virtual objects 
(Taguchi, Wilburn, and Zitnick 2008), as shown in Figure 12.15b. 


12.5.3 Application: Z-keying and background replacement 


Another application of real-time stereo matching is z-keying, which is the process of seg- 
menting a foreground actor from the background using depth information, usually for the 


purpose of replacing the background with some computer-generated imagery, as shown in 
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Figure 12.17 Background replacement using z-keying with a bi-layer segmentation algo- 
rithm (Kolmogorov, Criminisi et al. 2006) O 2006 IEEE. 


Figure 12.2g. 

Originally, z-keying systems required expensive custom-built hardware to produce the 
desired depth maps in real time and were, therefore, restricted to broadcast studio applications 
(Kanade, Yoshida et al. 1996; Iddan and Yahav 2001). Off-line systems were also developed 
for estimating 3D multi-viewpoint geometry from video streams (Section 14.5.4) (Kanade, 
Rander, and Narayanan 1997; Carranza, Theobalt et al. 2003; Zitnick, Kang et al. 2004; 
Vedula, Baker, and Kanade 2005). Highly accurate real-time stereo matching subsequently 
made it possible to perform z-keying on regular PCs, enabling desktop video conferencing 
applications such as those shown in Figure 12.17 (Kolmogorov, Criminisi et al. 2006), but 
these have mostly been replaced with deep networks for background replacement (Sengupta, 
Jayaram et al. 2020) and real-time 3D phone-based reconstruction algorithms for augmented 
reality (Figure 11.28 and Valentin, Kowdle et al. 2018). 


12.6 Deep neural networks 


As with other areas of computer vision, deep neural networks and end-to-end learning have 
had a large impact on stereo matching. In this section, we briefly review how DNNs have 
been used in stereo correspondence algorithms. We follow the same structure as the two 
recent surveys by Poggi, Tosi et al. (2021) and Laga, Jospin et al. (2020), which classify 
techniques into three categories, namely, 


1. learning in the stereo pipeline, 


2. end-to-end learning with 2D architectures, and 
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3. end-to-end learning with 3D architectures. 


We briefly discuss a few papers in each group and refer the reader to the full surveys for more 
details (Janai, Güney et al. 2020; Poggi, Tosi et al. 2021; Laga, Jospin et al. 2020). 


Learning in the stereo pipeline 


Even before the advent of deep learning, several authors proposed learning components of 
the traditional stereo pipeline, e.g., to learn hyperparameters of MRF and CRF stereo models 
(Zhang and Seitz 2007; Pal, Weinman et al. 2012). Zbontar and LeCun (2016) were the 
first to bring deep learning to stereo by training features to optimize a pairwise matching 
cost. These learned matching costs are still widely used in top-performing methods on the 
Middlebury stereo evaluation. Many other authors have since proposed CNNs for matching 
cost computation and aggregation (Luo, Schwing, and Urtasun 2016; Park and Lee 2017; 
Zhang, Prisacariu et al. 2019). 

Learning has also been used to improve traditional optimization techniques, in particular 
the widely used SGM algorithm of Hirschmiiller (2008). This includes SGM-Net (Seki and 
Pollefeys 2017), which uses a CNN to adjust transition costs, and SGM-Forest (Schónberger, 
Sinha, and Pollefeys 2018), which uses a random-forest classifier to select among disparity 
values from multiple incident directions. CNNs have also been used in the refinement stage, 
replacing earlier techniques such as bilateral filtering (Gidaris and Komodakis 2017; Batsos 
and Mordohai 2018; Knóbelreiter and Pock 2019). 


End-to-end learning with 2D architectures 


The availability of large synthetic datasets with ground truth disparities, in particular the 
Freiburg SceneFlow dataset (Mayer, Ilg et al. 2016, 2018) enabled the end-to-end training of 
stereo networks and resulted in a proliferation of new methods. These methods work well on 
benchmarks that provide enough training data so that the network can be tuned to the domain, 
notably KITTI (Geiger, Lenz, and Urtasun 2012; Geiger, Lenz et al. 2013; Menze and Geiger 
2015), where deep-learning based methods started to dominate the leaderboards in 2016. 
The first deep learning architectures for stereo were similar to those designed for dense 
regression tasks such as semantic segmentation (Chen, Zhu et al. 2018). These 2D architec- 
tures typically employ an encoder-decoder design inspired by U-Net (Ronneberger, Fischer, 
and Brox 2015). The first such model was DispNet-C, introduced in the seminal paper by 
Mayer, Ilg et al. (2016), utilizing a correlation layer (Dosovitskiy, Fischer et al. 2015) to 


compute the similarity between image layers. 
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Figure 12.18 Disparity maps computed by three different DNN stereo matchers trained on 
synthetic data and applied to real-world image pairs (Zhang, Qi et al. 2020) © 2020 Springer. 


Subsequent improvements to 2D architectures included the idea of residual networks that 
apply residual corrections to the original disparities (Pang, Sun ef al. 2017), which can also 
be done in an iterative fashion (Liang, Feng et al. 2018). Coarse-to-fine processing can be 
used (Tonioni, Tosi et al. 2019; Yin, Darrell, and Yu 2019), and networks can estimate oc- 
clusions and depth boundaries (Ilg, Saikia et al. 2018; Song, Zhao et al. 2020) or use neural 
architecture search (NAS) to improve performance (Saikia, Marrakchi et al. 2019). HITNet 
incorporates several of these ideas and produces efficient state-of-the-art results using local 
slanted plane hypotheses and iterative refinement (Tankovich, Hane et al. 2021). 

The 2D architecture developed by Knóbelreiter, Reinbacher et al. (2017) uses a joint 
CNN and Conditional Random Field (CRF) model to infer dense disparity maps. Another 
promising approach is multi-task learning, for instance, jointly estimating disparities and 
semantic segmentation (Yang, Zhao ef al. 2018; Jiang, Sun et al. 2019). It is also possible 
to increase the apparent resolution of the output depth map and reduce over-smoothing by 


representing the output as a bimodal mixture distribution (Tosi, Liao et al. 2021). 


End-to-end learning with 3D architectures 


An alternative approach is to use 3D architectures, which explicitly encode geometry by pro- 
cessing features over a 3D volume, where the third dimension corresponds to the disparity 
search range. In other words, such architectures explicitly represent the disparity space im- 
age (DSI), while still keeping multiple feature channels instead of just scalar cost values. 
Compared to 2D architectures, they incur much higher memory requirements and runtimes. 
The first examples of such architectures include GC-Net (Kendall, Martirosyan et al. 
2017) and PSMNet (Chang and Chen 2018). 3D architectures also allow the integration of 
traditional local aggregation methods (Zhang, Prisacariu et al. 2019) and methods to avoid 
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geometric inconsistencies (Chabra, Straub et al. 2019). While resource constraints often 
mean that 3D DNN-based stereo methods operate at fairly low resolutions, the Hierarchical 
Stereo Matching (HSM) network (Yang, Manela et al. 2019) uses a pyramid approach that 
selectively restricts the search space at higher resolutions and enables anytime on-demand in- 
ference, 1.e., stopping the processing early for higher frame rates. Duggal, Wang et al. (2019) 
address limited resources by developing a differentiable version of PatchMatch (Bleyer, Rhe- 
mann, and Rother 2011) in a recurrent neural net. Cheng, Zhong et al. (2020) use neural 
architecture search (NAS) to create a state-of-the-art 3D architecture. 

While supervised deep learning approaches have come to dominate individual bench- 
marks that include dedicated training sets such as KITTL they do not yet generalize well 
across domains (Zendel et al. 2020). On the Middlebury benchmark, which features high- 
resolution images and only provides very limited training data, deep learning methods are 
still notably absent. Poggi, Tosi ef al. (2021) identify the following two major challenges 
that remain open: (1) generalization across different domains, and (2) applicability on high- 
resolution images. For cross-domain generalization, Poggi, Tosi et al. (2021) describe tech- 
niques for both offline and online self-supervised adaptation and guided deep learning, while 
Laga, Jospin et al. (2020) discuss both fine-tuning and data transformation. A recent exam- 
ple of domain generalization is the domain-invariant stereo matching network (DSMNet) of 
Zhang, Qi et al. (2020), which compares favorably with alternative state-of-the-art models 
such as HD? (Yin, Darrell, and Yu 2019) and PSMNet (Chang and Chen 2018), as shown 
in Figure 12.18. Another example of domain adaptation is AdaStereo (Song, Yang et al. 
2021). For high-resolution images, techniques have been developed to increase resolution in 
a coarse-to-fine manner (Khamis, Fanello et al. 2018; Chabra, Straub et al. 2019). 


12.7 Multi-view stereo 


While matching pairs of images is a useful way of obtaining depth information, using more 
images can significantly improve results. In this section, we review not only techniques for 
creating complete 3D object models, but also simpler techniques for improving the quality 
of depth maps using multiple source images. A good survey of techniques developed up 
through 2015 can be found in Furukawa and Hernández (2015) and a more recent review in 
Janai, Giiney et al. (2020, Chapter 10). 

As we saw in our discussion of plane sweep (Section 12.1.2), it is possible to resample 
all neighboring k images at each disparity hypothesis d into a generalized disparity space 


volume I (a, y, d, k). The simplest way to take advantage of these additional images is to sum 
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Figure 12.19 Epipolar plane image (EPI) (Gortler, Grzeszczuk et al. 1996) O 1996 ACM 
and a schematic EPI (Kang, Szeliski, and Chai 2001) O 2001 IEEE. (a) The Lumigraph 
(light field) (Section 14.3) is the 4D space of all light rays passing through a volume of space. 
Taking a 2D slice results in all of the light rays embedded in a plane and is equivalent to a 
scanline taken from a stacked EPI volume. Objects at different depths move sideways with 
velocities (slopes) proportional to their inverse depth. Occlusion (and translucency) effects 
can easily be seen in this representation. (b) The EPI corresponding to Figure 12.20 showing 
the three images (middle, left, and right) as slices through the EPI volume. The spatially and 
temporally shifted window around the black pixel is indicated by the rectangle, showing that 


the right image is not being used in matching. 


up their differences from the reference image I, as in (12.4), 
Clx,y, d) = X` p(H(z,y, d, k) — 1,(x, y)). (12.11) 
k 


This is the basis of the well-known sum of summed-squared-difference (SSSD) and SSAD 
approaches (Okutomi and Kanade 1993; Kang, Webb et al. 1995), which can be extended 
to reason about likely patterns of occlusion (Nakamura, Matsuura ef al. 1996). More recent 
work by Gallup, Frahm et al. (2008) shows how to adapt the baselines used to the expected 
depth to get the best tradeoff between geometric accuracy (wide baseline) and robustness to 
occlusion (narrow baseline). Alternative multi-view cost metrics include measures such as 
synthetic focus sharpness and the entropy of the pixel color distribution (Vaish, Szeliski et al. 
2006). 

A useful way to visualize the multi-frame stereo estimation problem is to examine the 
epipolar plane image (EPI) formed by stacking corresponding scanlines from all the images, 
as shown in Figures 9.11c and 12.19 (Bolles, Baker, and Marimont 1987; Baker and Bolles 
1989; Baker 1989). As you can see in Figure 12.19, as a camera translates horizontally (in a 
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Figure 12.20  Spatio-temporally shiftable windows (Kang, Szeliski, and Chai 2001) O 2001 
IEEE: A simple three-image sequence (the middle image is the reference image), which has a 
moving frontal gray square (marked F) and a stationary background. Regions B, C, D, and E 
are partially occluded. (a) A regular SSD algorithm will make mistakes when matching pixels 
in these regions (e.g., the window centered on the black pixel in region B) and in windows 
straddling depth discontinuities (the window centered on the white pixel in region F). (b) 
Shiftable windows help mitigate the problems in partially occluded regions and near depth 
discontinuities. The shifted window centered on the white pixel in region F matches correctly 
in all frames. The shifted window centered on the black pixel in region B matches correctly 
in the left image, but requires temporal selection to disable matching the right image. Fig- 
ure 12.19b shows an EPI corresponding to this sequence and describes in more detail how 


temporal selection works. 


standard horizontally rectified geometry), objects at different depths move sideways at a rate 
inversely proportional to their depth (12.1).? Foreground objects occlude background objects, 
which can be seen as EPI-strips (Criminisi, Kang et al. 2005) occluding other strips in the 
EPI. If we are given a dense enough set of images, we can find such strips and reason about 
their relationships to both reconstruct the 3D scene and make inferences about translucent 
objects (Tsin, Kang, and Szeliski 2006) and specular reflections (Swaminathan, Kang et al. 
2002; Criminisi, Kang ef al. 2005). Alternatively, we can treat the series of images as a set 
of sequential observations and merge them using Kalman filtering (Matthies, Kanade, and 
Szeliski 1989) or maximum likelihood inference (Cox 1994). 

When fewer images are available, it becomes necessary to fall back on aggregation tech- 
niques, such as sliding windows or global optimization. With additional input images, how- 
ever, the likelihood of occlusions increases. It is therefore prudent to adjust not only the best 
window locations using a shiftable window approach, as shown in Figure 12.20a, but also to 
optionally select a subset of neighboring frames to discount those images where the region 
of interest is occluded, as shown in Figure 12.20b (Kang, Szeliski, and Chai 2001). Fig- 


7The four-dimensional generalization of the EPI is the light field, which we study in Section 14.3. 
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Figure 12.21 Local (5 x 5 window-based) matching results (Kang, Szeliski, and Chai 
2001) O 2001 IEEE: (a) window that is not spatially perturbed (centered); (b) spatially 
perturbed window; (c) using the best five of 10 neighboring frames; (d) using the better half 


sequence. Notice how the results near the tree trunk are improved using temporal selection. 


ure12.19b shows how such spatio-temporal selection or shifting of windows corresponds to 
selecting the most likely un-occluded volumetric region in the epipolar plane image volume. 

The results of applying these techniques to the multi-frame flower garden image sequence 
are shown in Figure 12.21, which compares the results of using regular (non-shifted) SSSD 
with spatially shifted windows and full spatio-temporal window selection. (The task of apply- 
ing stereo to a rigid scene filmed with a moving camera is sometimes called motion stereo). 
Similar improvements from using spatio-temporal selection are reported by Kang and Szeliski 


(2004) and are evident even when local measurements are combined with global optimization. 


While computing a depth map from multiple inputs outperforms pairwise stereo match- 
ing, even more dramatic improvements can be obtained by estimating multiple depth maps 
simultaneously (Szeliski 1999a; Kang and Szeliski 2004). The existence of multiple depth 
maps enables more accurate reasoning about occlusions, as regions that are occluded in one 
image may be visible (and matchable) in others. The multi-view reconstruction problem can 
be formulated as the simultaneous estimation of depth maps at key frames (Figure 9.1 1c) 
while maximizing not only photoconsistency and piecewise disparity smoothness, but also 
the consistency between disparity estimates at different frames. While Szeliski (1999a) and 
Kang and Szeliski (2004) use soft (penalty-based) constraints to encourage multiple disparity 
maps to be consistent, Kolmogorov and Zabih (2002) show how such consistency measures 
can be encoded as hard constraints, which guarantee that the multiple depth maps are not only 
similar but actually identical in overlapping regions. Additional algorithms that simultane- 
ously estimate multiple disparity maps include those of Maitre, Shinagawa, and Do (2008) 
and Zhang, Jia et al. (2008) and the widely used COLMAP algorithm (Schónberger, Zheng ef 
al. 2016), which uses view selection and geometric consistency between multiple depth maps 
to filter matches, as shown in Figure 12.26b. 


The latest multi-view stereo algorithms use deep neural networks to compute matching 
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PMVS SurfaceNet MVSNet Ground truth 


Figure 12.22 Depth maps computed using three different multi-view stereo algorithms 
shown as colored point clouds (Yao, Luo et al. 2018) © 2018 Springer. The red boxes in- 
dicate problem areas where MVSNet does better. 


(cost) volumes and to fuse these into disparity maps. The DeepMVS system computes pair- 
wise matching costs between a reference image and neighboring images and then fuses them 
together with max pooling, followed by a dense CRF refinement (Huang, Matzen et al. 2018). 
MVSNet computes the variance between all encoded images warped onto each sweep plane, 
uses a 3D U-Net to regularize the costs, and then a soft argmin and depth refinement network 
to produce good results on the DTU and Tanks and Temples datasets (Yao, Luo et al. 2018), 
as shown in Figure 12.22. 

More recent variants on such networks include P-MVSNet (Luo, Guan et al. 2019), which 
uses a patch-wise matching confidence aggregator, and CasMVSNet (Gu, Fan et al. 2020) and 
CVP-MVSNet (Yang, Mao et al. 2020), both of which use coarse-to-fine pyramid processing. 
Four even more recent papers that all score well on the DTU, ETH3D, Tanks and Temples, 
and/or Blended MVS datasets are Vis-MVSNet (Zhang, Yao et al. 2020), D?HC-RMVSNet 
(Yan, Wei et al. 2020), DeepC-MVS (Kuhn, Sormann et al. 2020), and PatchmatchNet (Wang, 
Galliani et al. 2021). These algorithms use various combinations of visibility and occlusion 
reasoning, confidence or uncertainty maps, and geometric consistency checks, and efficient 
propagation schemes to achieve good results. As so many new multi-view stereo papers 
continue to get published, the ETH3D and Tanks and Temples leaderboards (Table12.1) are 
good places to look for the latest results. 


12.7.1 Scene flow 


A closely related topic to multi-frame stereo estimation is scene flow, in which multiple cam- 
eras are used to capture a dynamic scene. The task is then to simultaneously recover the 3D 
shape of the object at every instant in time and to estimate the full 3D motion of every surface 


point between frames. Representative papers in this area include Vedula, Baker et al. (2005), 
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Left camera image 3D scene flow with color encoded vector velocity 


(a) (b) 


Figure 12.23  Three-dimensional scene flow: (a) computed from a multi-camera dome sur- 
rounding the dancer shown in Figure 12.2h-j (Vedula, Baker et al. 2005) O 2005 IEEE; (b) 
computed from stereo cameras mounted on a moving vehicle (Wedel, Rabe et al. 2008) O 
2008 Springer. 


Zhang and Kambhamettu (2003), Pons, Keriven, and Faugeras (2007), Huguet and Devernay 
(2007), Wedel, Rabe et al. (2008), and Rabe, Miiller et al. (2010). Figure 12.23a shows an im- 
age of the 3D scene flow for the tango dancer shown in Figure 12.2h-j, while Figure 12.23b 
shows 3D scene flows captured from a moving vehicle for the purpose of obstacle avoid- 
ance. In addition to supporting mensuration and safety applications, scene flow can be used 
to support both spatial and temporal view interpolation (Section 14.5.4), as demonstrated by 
Vedula, Baker, and Kanade (2005). 

The creation of the KITTI scene flow dataset (Geiger, Lenz, and Urtasun 2012) as well 
as increased interest in autonomous driving have accelerated research into scene flow algo- 
rithms (Janai, Güney et al. 2020, Chapter 12). One way to help regularize the problem is 
to adopt a piecewise planar representation (Vogel, Schindler, and Roth 2015). Another is 
to decompose the scene into rigid separately moving objects such as vehicles (Menze and 
Geiger 2015), using semantic segmentation (Behl, Hosseini Jafari et al. 2017), as well as to 
use other segmentation cues (Ilg, Saikia et al. 2018; Ma, Wang et al. 2019; Jiang, Sun et al. 
2019). The more widespread availability of 3D sensors has enabled the extension of scene 
flow algorithms to use this modality as an additional input (Sun, Sudderth, and Pfister 2015; 
Behl, Paschalidou et al. 2019). 


12.7.2 Volumetric and 3D surface reconstruction 


The most challenging but also most useful variant of multi-view stereo reconstruction is the 
construction of globally consistent 3D models (Seitz, Curless ef al. 2006). This topic has a 
long history in computer vision, starting with surface mesh reconstruction techniques such 
as the one developed by Fua and Leclerc (1995) (Figure 12.24a). A variety of approaches 


and representations have been used to solve this problem, including 3D voxels (Seitz and 
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Figure 12.24 Multi-view stereo algorithms: (a) surface-based stereo (Fua and Leclerc 
1995) © 1995 Springer; (b) voxel coloring (Seitz and Dyer 1999) © 1999 Springer; (c) depth 
map merging (Narayanan, Rander, and Kanade 1998) © 1998 IEEE; (d) level set evolution 
(Faugeras and Keriven 1998) © 1998 IEEE; (e) silhouette and stereo fusion (Hernández and 
Schmitt 2004) © 2004 Elsevier; (f) multi-view image matching (Pons, Keriven, and Faugeras 
2005) © 2005 IEEE; (g) volumetric graph cut (Vogiatzis, Torr, and Cipolla 2005) © 2005 
IEEE; (h) carved visual hulls (Furukawa and Ponce 2009) © 2009 Springer. 


Dyer 1999; Szeliski and Golland 1999; De Bonet and Viola 1999; Kutulakos and Seitz 2000; 
Eisert, Steinbach, and Girod 2000; Slabaugh, Culbertson et al. 2004; Sinha and Pollefeys 
2005; Vogiatzis, Hernandez et al. 2007; Hiep, Keriven et al. 2009), level sets (Faugeras and 
Keriven 1998; Pons, Keriven, and Faugeras 2007), polygonal meshes (Fua and Leclerc 1995; 
Narayanan, Rander, and Kanade 1998; Hernandez and Schmitt 2004; Furukawa and Ponce 
2009), and multiple depth maps (Kolmogorov and Zabih 2002). Figure 12.24 shows repre- 


sentative examples of 3D object models reconstructed using some of these techniques. 


To organize and compare all these techniques, Seitz, Curless et al. (2006) developed a 


six-point taxonomy that can help classify algorithms according to the scene representation, 
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Figure 12.25 Multi-view stereo (a) scene representations and (b) processing pipelines, 
from Furukawa and Hernández (2015) © 2015 now publishers. 


photoconsistency measure, visibility model, shape priors, reconstruction algorithm, and ini- 
tialization requirements they use. Below, we summarize some of these choices and list a few 
representative papers. For more details, please consult the full survey paper (Seitz, Curless 
et al. 2006) as well as more recent surveys by Furukawa and Ponce (2010) and Janai, Güney 
et al. (2020, Chapter 10). The ETH3D and Tanks and Temples leaderboards list the most 
up-to-date results and pointers to recent papers. 


Scene representation. According to the taxonomy proposed by Furukawa and Ponce (2010), 
multi-view stereo algorithms primarily use four scene representations, namely depth maps, 
point clouds, volumetric fields, and 3D meshes, as shown in Figure 12.25a. These are often 
combined into a complete pipeline that includes camera pose estimation, per-image depth 
map or point cloud computation, volumetric fusion, and surface mesh refinement (Pollefeys, 
Nistér et al. 2008), as shown in Figure 12.25b. 

We have already discussed multi-view depth map estimation earlier in this section. An ex- 
ample of a point cloud representation is the patch-based multi-view stereo (PMVS) algorithm 
developed by Furukawa and Ponce (2010), which starts with sparse 3D points reconstructed 
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Figure 12.26 Point cloud reconstruction: (a) the PMVS pipeline, showing a sample input 
image, detected features, initial reconstructed patches, patches after expansion and filtering, 
and the final mesh model (Furukawa and Ponce 2010) O 2010 IEEE; (b) depth maps and 
surface normals from two stages of the COLMAP multi-view stereo pipeline (Schónberger, 
Zheng et al. 2016) O 2016 Springer; (c) thin structures recovered from gradients in a dense 
orbiting camera light field (Yticer, Kim et al. 2016) O 2016 IEEE. 


using structure from motion, then optimizes and densifies local oriented patches or surfels 
(Szeliski and Tonnesen 1992; Section 13.4) while taking into account visibility constraints, 
as shown in Figure 12.26a. This representation can then be turned into a mesh for final 
optimization. Since then, improved techniques have been developed for view selection and 
filtering as well as normal estimation, as exemplified in the systems developed by Fuhrmann 
and Goesele (2014) and Schónberger, Zheng ef al. (2016), the latter of which (shown in 
Figures 11.20b and 12.26b) provides the dense multi-view stereo component of the popular 
COLMAP open-source reconstruction system (Schónberger and Frahm 2016). When highly 
sampled video sequences are available, reconstructing point clouds from tracked edges may 
be more appropriate, as discussed in Section 12.2.1, Kim, Zimmer et al. (2013) and Yiicer, 
Kim et al. (2016) and illustrated in Figure 12.26c. 
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One of the more popular 3D representations is a uniform grid of 3D voxels,* which can be 
reconstructed using a variety of carving techniques (Seitz and Dyer 1999; Kutulakos and Seitz 
2000) or optimization (Sinha and Pollefeys 2005; Vogiatzis, Hernández et al. 2007; Hiep, 
Keriven et al. 2009; Jancosek and Pajdla 2011; Vu, Labatut et al. 2012), some of which are 
illustrated in Figure 12.24. Level set techniques (Section 7.3.2) also operate on a uniform grid 
but, instead of representing a binary occupancy map, they represent the signed distance to the 
surface (Faugeras and Keriven 1998; Pons, Keriven, and Faugeras 2007), which can encode 
a finer level of detail and also be used to merge multiple point clouds or range data scans, as 
discussed extensively in Section 13.2.1. Instead of using a uniformly sampled volume, which 
works best for compact 3D objects, it is also possible to create a view frustum corresponding 
to one of the input images and to sample the z dimension as inverse depths, 1.e., uniform 
disparities for a set of co-planar cameras (Figure 14.7). This kind of representation is called 
a stack of acetates in Szeliski and Golland (1999) and multiplane images in Zhou, Tucker ef 
al. (2018). 


Polygonal meshes are another popular representation (Fua and Leclerc 1995; Narayanan, 
Rander, and Kanade 1998; Isidoro and Sclaroff 2003; Hernandez and Schmitt 2004; Fu- 
rukawa and Ponce 2009; Hiep, Keriven et al. 2009). Meshes are the standard representation 
used in computer graphics and also readily support the computation of visibility and occlu- 
sions. Finally, as we discussed in the previous section, multiple depth maps can also be used 
(Szeliski 1999a; Kolmogorov and Zabih 2002; Kang and Szeliski 2004). Many algorithms 
also use more than a single representation, e.g., they may start by computing multiple depth 
maps and then merge them into a 3D object model (Narayanan, Rander, and Kanade 1998; 
Furukawa and Ponce 2009; Goesele, Curless, and Seitz 2006; Goesele, Snavely et al. 2007; 
Pollefeys, Nistér et al. 2008; Furukawa, Curless et al. 2010; Furukawa and Ponce 2010; Vu, 
Labatut et al. 2012; Schonberger, Zheng ef al. 2016), as illustrated in Figure 12.25b. 


An example of a recent system that combines several representations into a scalable dis- 
tributed approach that can handle datasets with hundreds of high-resolution images is the 
LTVRE multi-view stereo system by Kuhn, Hirschmiiller et al. (2017). The system starts 
from pairwise disparity maps computed with SGM (Hirschmiiller 2008). These depth esti- 
mates are fused with a probabilistic multi-scale approach using a learned stereo error model, 
using an octree to handle variable resolution, followed by filtering of conflicting points based 
on visibility constraints, and finally triangulation. Figure 12.27 shows an illustration of the 


processing pipeline. 


8For outdoor scenes that go to infinity, an inverted gridding of space may be preferable (Slabaugh, Culbertson et 
al. 2004; Zhang, Riegler et al. 2020). 
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Figure 12.27 3D reconstruction pipeline from Kuhn, Hirschmiiller et al. (2017) O 2017 
Springer: (0) structure from motion; (1) stereo matching using semi-global matching; (2) 
depth quality estimation; (3) probabilistic space occupancy; (4+5) probabilistic point opti- 
mization and outlier filtering; (6) triangulation. The images in (4+5) and (6) are half texture- 
mapped and half flat shaded to show more surface detail. 


Photoconsistency measure. As we discussed in (Section 12.3.1), a variety of similarity 
measures can be used to compare pixel values in different images, including measures that 
try to discount illumination effects or be less sensitive to outliers. In multi-view stereo, algo- 
rithms have a choice of computing these measures directly on the surface of the model, i.e., 
in scene space, or projecting pixel values from one image (or from a textured model) back 
into another image, i.e., in image space. (The latter corresponds more closely to a Bayesian 
approach, because input images are noisy measurements of the colored 3D model.) The ge- 
ometry of the object, i.e., its distance to each camera and its local surface normal, when 
available, can be used to adjust the matching windows used in the computation to account for 
foreshortening and scale change (Goesele, Snavely et al. 2007). 


Visibility model. A big advantage that multi-view stereo algorithms have over single-depth- 
map approaches is their ability to reason in a principled manner about visibility and occlu- 
sions. Techniques that use the current state of the 3D model to predict which surface pixels 
are visible in each image (Kutulakos and Seitz 2000; Faugeras and Keriven 1998; Vogiatzis, 
Hernández et al. 2007; Hiep, Keriven et al. 2009; Furukawa and Ponce 2010; Schónberger, 
Zheng et al. 2016) are classified as using geometric visibility models in the taxonomy of Seitz, 
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Curless ef al. (2006). Techniques that select a neighboring subset of image to match are called 
quasi-geometric (Narayanan, Rander, and Kanade 1998; Kang and Szeliski 2004; Hernández 
and Schmitt 2004), while techniques that use traditional robust similarity measures are called 
outlier-based. While full geometric reasoning is the most principled and accurate approach, 
it can be very slow to evaluate and depends on the evolving quality of the current surface 
estimate to predict visibility, which can be a bit of a chicken-and-egg problem, unless conser- 


vative assumptions are used, as they are by Kutulakos and Seitz (2000). 


Shape priors. Because stereo matching is often underconstrained, especially in texture- 
less regions, most matching algorithms adopt (either explicitly or implicitly) some form of 
prior model for the expected shape. Many of the techniques that rely on optimization use a 
3D smoothness or area-based photoconsistency constraint, which, because of the natural ten- 
dency of smooth surfaces to shrink inwards, often results in a minimal surface prior (Faugeras 
and Keriven 1998; Sinha and Pollefeys 2005; Vogiatzis, Hernández et al. 2007). Approaches 
that carve away the volume of space often stop once a photoconsistent solution is found (Seitz 
and Dyer 1999; Kutulakos and Seitz 2000), which corresponds to a maximal surface bias, 1.e., 
these techniques tend to over-estimate the true shape. Finally, multiple depth map approaches 
often adopt traditional image-based smoothness (regularization) constraints. 

Higher-level shape priors can also be used, such as Manhattan world assumptions that 
assume most surfaces of interest are axis-aligned (Furukawa, Curless et al. 2009a,b) or at 
related orientations such as slanted roofs (Sinha, Steedly, and Szeliski 2009; Osman Ulusoy, 
Black, and Geiger 2017). These kinds of architectural priors are discussed in more detail in 
Section 13.6.1. It is also possible to use 2D semantic segmentation in images, e.g., into wall, 
ground, and foliage classes, to apply different kinds of regularization and surface normal 
priors in different regions of the model (Häne, Zach et al. 2013). 


Reconstruction algorithm. The details of how the actual reconstruction algorithm pro- 
ceeds is where the largest variety—and greatest innovation—in multi-view stereo algorithms 
can be found. 

Some approaches use global optimization defined over a three-dimensional photoconsis- 
tency volume to recover a complete surface. Approaches based on graph cuts use polynomial 
complexity binary segmentation algorithms to recover the object model defined on the voxel 
grid (Sinha and Pollefeys 2005; Vogiatzis, Hernandez et al. 2007; Hiep, Keriven et al. 2009; 
Jancosek and Pajdla 2011; Vu, Labatut et al. 2012). Level set approaches use a continuous 
surface evolution to find a good minimum in the configuration space of potential surfaces and 


therefore require a reasonably good initialization (Faugeras and Keriven 1998; Pons, Keriven, 
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Figure 12.28 Voxel coloring (Seitz and Dyer 1999) © 1999 Springer and space carving 
(Kutulakos and Seitz 2000) © 2000 Springer. (a-b): voxel coloring sweeps a plane through 
the scene in a front-to-back manner with respect to the cameras. (c-d): space carving uses 


multiple sweep directions to deal with more general camera configurations. 


and Faugeras 2007). For the photoconsistency volume to be meaningful, matching costs need 
to be computed in some robust fashion, e.g., using sets of limited views or by aggregating 
multiple depth maps. 

An alternative approach to global optimization is to sweep through the 3D volume while 
computing both photoconsistency and visibility simultaneously. The voxel coloring algorithm 
of Seitz and Dyer (1999) performs a front-to-back plane sweep. On every plane, any voxels 
that are sufficiently photoconsistent are labeled as part of the object. The corresponding 
pixels in the source images can then be “erased”, as they are already accounted for, and 
therefore do not contribute to further photoconsistency computations. (A similar approach, 
albeit without the front-to-back sweep order, is used by Szeliski and Golland (1999).) The 
resulting 3D volume, under noise- and resampling-free conditions, is guaranteed to produce 
both a photoconsistent 3D model and to enclose whatever true 3D object model generated the 
images (Figure 12.28a—b). 

Unfortunately, voxel coloring is only guaranteed to work if all of the cameras lie on the 
same side of the sweep planes, which is not possible in general ring configurations of cameras. 
Kutulakos and Seitz (2000) generalize voxel coloring to space carving, where subsets of 
cameras that satisfy the voxel coloring constraint are iteratively selected and the 3D voxel 
grid is alternately carved away along different axes (Figure 12.28c-d). 

Another popular approach to multi-view stereo is to first independently compute multiple 
depth maps and then merge these partial maps into a complete 3D model. Approaches to 
depth map merging, which are discussed in more detail in Section 13.2.1, include signed 
distance functions (Curless and Levoy 1996), used by Goesele, Curless, and Seitz (2006), 
and Poisson surface reconstruction (Kazhdan, Bolitho, and Hoppe 2006), used by Goesele, 
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Snavely et al. (2007). 


Initialization requirements. One final element discussed by Seitz, Curless et al. (2006) 
is the varying degrees of initialization required by different algorithms. Because some algo- 
rithms refine or evolve a rough 3D model, they require a reasonably accurate (or overcom- 
plete) initial model, which can often be obtained by reconstructing a volume from object 
silhouettes, as discussed in Section 12.7.3. However, if the algorithm performs a global 
optimization (Kolev, Klodt et al. 2009; Kolev and Cremers 2009), this dependence on initial- 
ization is not an issue. 


Empirical evaluation. The widespread adoption of datasets and benchmarks has led to the 
rapid advances in multi-view reconstruction over the last two decades. Table 12.1 lists some 
of the most widely used and influential ones, with sample images and/or results shown in 
Figures 12.1, 12.22, and 12.26. Pointers to additional datasets can be found in Mayer, Ilg 
et al. (2018), Janai, Güney et al. (2020), Laga, Jospin et al. (2020), and Poggi, Tosi et al. 
(2021). Pointers to the most recent algorithms can usually be found on the leaderboards of 
the ETH3D and Tanks and Temples benchmarks. 


12.7.3 Shape from silhouettes 


In many situations, performing a foreground—background segmentation of the object of in- 
terest is a good way to initialize or fit a 3D model (Grauman, Shakhnarovich, and Darrell 
2003; Vlasic, Baran et al. 2008) or to impose a convex set of constraints on multi-view stereo 
(Kolev and Cremers 2008). Over the years, a number of techniques have been developed to 
reconstruct a 3D volumetric model from the intersection of the binary silhouettes projected 
into 3D. The resulting model is called a visual hull (or sometimes a line hull), analogous with 
the convex hull of a set of points, because the volume is maximal with respect to the visual 
silhouettes and surface elements are tangent to the viewing rays (lines) along the silhouette 
boundaries (Laurentini 1994). It is also possible to carve away a more accurate reconstruction 
using multi-view stereo (Sinha and Pollefeys 2005) or by analyzing cast shadows (Savarese, 
Andreetto et al. 2007). 

Some techniques first approximate each silhouette with a polygonal representation and 
then intersect the resulting faceted conical regions in three-space to produce polyhedral mod- 
els (Baumgart 1974; Martin and Aggarwal 1983; Matusik, Buehler, and McMillan 2001), 
which can later be refined using triangular splines (Sullivan and Ponce 1998). Other ap- 
proaches use voxel-based representations, usually encoded as octrees (Samet 1989), because 
of the resulting space-time efficiency. Figures 12.29a—b show an example of a 3D octree 
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Figure 12.29 Volumetric octree reconstruction from binary silhouettes (Szeliski 1993) O 
1993 Elsevier: (a) octree representation and its corresponding (b) tree structure; (c) input 


image of an object on a turntable; (d) computed 3D volumetric octree model. 


model and its associated colored tree, where black nodes are interior to the model, white 
nodes are exterior, and gray nodes are of mixed occupancy. Examples of octree-based re- 
construction approaches include Potmesil (1987), Noborio, Fukada, and Arimoto (1988), 
Srivasan, Liang, and Hackwood (1990), and Szeliski (1993). 

The approach of Szeliski (1993) first converts each binary silhouette into a one-sided 
variant of a distance map, where each pixel in the map indicates the largest square that is 
completely inside (or outside) the silhouette. This makes it fast to project an octree cell 
into the silhouette to confirm whether it is completely inside or outside the object, so that it 
can be colored black, or white, or left as gray (mixed) for further refinement on a smaller 
grid. The octree construction algorithm proceeds in a coarse-to-fine manner, first building an 
octree at a relatively coarse resolution, and then refining it by revisiting and subdividing all 
the input images for the gray (mixed) cells whose occupancy has not yet been determined. 
Figure 12.29d shows the resulting octree model computed from a coffee cup rotating on a 
turntable. 

More recent work on visual hull computation borrows ideas from image-based rendering, 
and is hence called an image-based visual hull (Matusik, Buehler et al. 2000). Instead of 
precomputing a global 3D model, an image-based visual hull is recomputed for each new 


viewpoint, by successively intersecting viewing ray segments with the binary silhouettes in 
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(a) image (b) GT (C) Eigen et al. [6] (d) GeoNet [24] (€) FrameNet [12] (f) vp-line model 


Figure 12.30 Monocular depth inference (shown as color-coded normal maps) from im- 
ages in the NYU Depth Dataset V2 (Wang, Geraghty et al. 2020) O 2020 IEEE. 


each image. This not only leads to a fast computation algorithm but also enables fast texturing 
of the recovered model with color values from the input images. This approach can also 
be combined with high-quality deformable templates to capture and re-animate whole body 
motion (Vlasic, Baran et al. 2008). 

While the methods described above start with a binary foreground/background silhouette 
image, it is also possible to extract silhouette curves, usually to sub-pixel precision, and 
to reconstruct partial surface meshes from tracking these, as discussed in Section 12.2.1. 
Such silhouette curves can also be combined with regular image edges to construct complete 
surface models (Yticer, Kim et al. 2016), such as the ones shown in Figure 12.26c. 


12.8 Monocular depth estimation 


The ability to infer (or hallucinate?) depth maps from single images opens up all kinds of 
creative possibilities, such as displaying them in 3D (Figure 6.41 and Kopf, Matzen et al. 
2020), creating soft focus effects (Section 10.3.2), and potentially to aid scene understanding. 
It can also be used in robotics applications such as autonomous navigation (Figure 12.31), al- 
though most (autonomous and regular) vehicles have more than one camera or range sensors, 
if equipped with computer vision. 

We already saw in Section 6.4.4 how the automatic photo pop-up system can use image 
segmentation and classification to create “cardboard cut-out” versions of a photo (Hoiem, 
Efros, and Hebert 2005a; Saxena, Sun, and Ng 2009). More recent systems to infer depth 
from single images use deep neural networks. These are described in two recent surveys 
(Poggi, Tosi et al. 2021, Section 7; Zhao, Sun et al. 2020), which discuss 20 and over 50 
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Ours 


Input GT Eigen et al. [10] Liu etal. [36] Garg et al. [16] 


Figure 12.31 Monocular depth map estimates from images in the KITTI dataset (Godard, 
Mac Aodha, and Brostow 2017) © 2017 IEEE. 


related techniques, respectively, and benchmark them on the KITTI dataset (Geiger, Lenz, 
and Urtasun 2012) shown in Figure 12.31. 

One of the first papers to use a neural network to compute a depth map was the system 
developed by Eigen, Puhrsch, and Fergus (2014), which was subsequently extended to also 
infer surface normals and semantic labels (Eigen and Fergus 2015). These systems were 
trained and tested on the NYU Depth Dataset V2 (Silberman, Hoiem et al. 2012), shown in 
Figure 12.30, and the KITTI dataset. Most of the subsequent work in this area trains and tests 
on these two fairly restricted datasets (indoor apartments or outdoor street scenes), although 
authors sometimes use Make3D (Saxena, Sun, and Ng 2009) or Cityscapes (Cordts, Omran et 
al. 2016), which are both outdoor street scenes, or ScanNet (Dai, Chang et al. 2017), which 
has indoor scenes. The danger in training and testing on such “closed world” datasets is 
that the network can learn shortcuts, such as inferring depth based on the location along the 
ground plane or failing to “pop up” objects that are not in commonly occurring classes (van 
Dijk and de Croon 2019). 

Early systems were trained on the depth images that came with datasets such as NYU 
Depth, KITTI, and ScanNet, where it turns out that adding soft constraints such as co- 
planarity can improve performance (Wang, Shen et al. 2016; Yin, Liu ef al. 2019). Later, 
“unsupervised” techniques were introduced that use photometric consistency between warped 
stereo pairs of images (Godard, Mac Aodha, and Brostow 2017; Xian, Shen et al. 2018) or 
image pairs in video sequences (Zhou, Brown et al. 2017). It is also possible to train on 3D 
reconstructions of famous landmarks (Li and Snavely 2018), image sets containing people 
posing in a “Mannequin Challenge” (Li, Dekel ef al. 2019), or to take more diverse “images 
in the wild” and have them labeled with relative depths (Chen, Fu et al. 2016). 

A recent paper that federates several such datasets is the MiDaS system developed by 
Ranftl, Lasinger et al. (2020), who not only use a number of existing “in the wild” datasets 
to train a network based on Xian, Shen et al. (2018), but also use thousands of stereo image 
pairs from over a dozen 3D movies as additional training, validation, and test data. In their 
paper, they not only show that their system produces superior results to previous approaches 
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Figure 12.32 Monocular depth map estimates and novel views from images in COCO 
dataset (Ranftl, Lasinger et al. 2020) O 2020 IEEE. 


(Figure 12.32), but also argue that their zero-shot cross-dataset transfer protocol, 1.e., test- 
ing on data sets separate from training sets, rather than using random train and test subsets, 
produces a system that works better on real-world images and avoids unintended dataset bias 
(Torralba and Efros 2011). 


An alternative to inferring depth maps from single images is to infer complete closed 3D 
shapes, using either volumetric (Choy, Xu et al. 2016; Girdhar, Fouhey et al. 2016; Tulsiani, 
Gupta et al. 2018) or mesh-based (Gkioxari, Malik, and Johnson 2019) representations (Han, 
Laga, and Bennamoun 2021). Instead of applying deep networks to just a single color image, 
it is also possible to augment such networks with additional cues and representations, such 
as oriented lines and planes (Wang, Geraghty ef al. 2020), which serve as higher-level shape 
priors (Sections 12.7.2 and 13.6.1). Neural rendering can also be used to create novel views 
(Tucker and Snavely 2020; Wiles, Gkioxari et al. 2020; Figure 14.22d), and to make the 
monocular depth predictions consistent over time (Luo, Huang et al. 2020; Teed and Deng 
2020a; Kopf, Rong, and Huang 2021). An example of a consumer application of monocular 
depth inference is One Shot 3D Photography (Kopf, Matzen et al. 2020), where the system, 
implemented on a mobile phone using a compact and efficient DNN, first infers a depth map, 
then converts this to multiple layers, inpaints the background, creates a mesh and texture atlas, 


and then provides real-time interactive viewing on the phone, as shown in Figure 14.10c. 
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12.9 Additional reading 


The field of stereo correspondence and depth estimation is one of the oldest and most widely 
studied topics in computer vision. A number of good surveys have been written over the years 
(Marr and Poggio 1976; Barnard and Fischler 1982; Dhond and Aggarwal 1989; Scharstein 
and Szeliski 2002; Brown, Burschka, and Hager 2003; Seitz, Curless et al. 2006; Furukawa 
and Hernández 2015; Janai, Giiney et al. 2020; Laga, Jospin et al. 2020; Poggi, Tosi et al. 
2021) and they can serve as good guides to this extensive literature. 


Because of computational limitations and the desire to find appearance-invariant cor- 
respondences, early algorithms often focused on finding sparse correspondences (Hannah 
1974; Marr and Poggio 1979; Mayhew and Frisby 1980; Ohta and Kanade 1985; Bolles, 
Baker, and Marimont 1987; Matthies, Kanade, and Szeliski 1989). 

The topic of computing epipolar geometry and pre-rectifying images is covered in Sec- 
tions 11.3 and 12.1 and is also treated in textbooks on multi-view geometry (Faugeras and 
Luong 2001; Hartley and Zisserman 2004) and articles specifically on this topic (Torr and 
Murray 1997; Zhang 1998a,b). The concepts of the disparity space and disparity space im- 
age are often associated with the seminal work by Marr (1982) and the papers of Yang, Yuille, 
and Lu (1993) and Intille and Bobick (1994). The plane sweep algorithm was first popular- 
ized by Collins (1996) and then generalized to a full arbitrary projective setting by Szeliski 
and Golland (1999) and Saito and Kanade (1999). Plane sweeps can also be formulated using 
cylindrical surfaces (Ishiguro, Yamamoto, and Tsuji 1992; Kang and Szeliski 1997; Shum 
and Szeliski 1999; Li, Shum et al. 2004; Zheng, Kang et al. 2007) or even more general 
topologies (Seitz 2001). 

Once the topology for the cost volume or DSI has been set up, we need to compute the 
actual photoconsistency measures for each pixel and potential depth. A wide range of such 
measures have been proposed, as discussed in Section 12.3.1. Some of these are compared in 
recent surveys and evaluations of matching costs (Scharstein and Szeliski 2002; Hirschmiiller 
and Scharstein 2009). 


To compute an actual depth map from these costs, some form of optimization or selection 
criterion must be used. The simplest of these are sliding windows of various kinds, which are 
discussed in Section 12.4 and surveyed by Gong, Yang et al. (2007) and Tombari, Mattoccia 
et al. (2008). Global optimization frameworks are often used to compute the best dispar- 
ity field, as described in Section 12.5. These techniques include dynamic programming and 
truly global optimization algorithms, such as graph cuts and loopy belief propagation. More 
recently, deep neural networks have become popular for computing correspondence and dis- 


parity maps, as discussed in Section 12.6 and surveyed in Laga, Jospin ef al. (2020) and 
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Poggi, Tosi et al. (2021). A good place to find pointers to the latest results in this field is the 
list of benchmarks in Table 12.1. 

Algorithms for multi-view stereo typically fall into two categories (Furukawa and Hernández 
2015). The first include algorithms that compute traditional depth maps using several images 
for computing photoconsistency measures (Okutomi and Kanade 1993; Kang, Webb et al. 
1995; Szeliski and Golland 1999; Vaish, Szeliski et al. 2006; Gallup, Frahm et al. 2008; 
Huang, Matzen et al. 2018; Yao, Luo et al. 2018). Some of these techniques compute mul- 
tiple depth maps and use additional constraints to encourage the different depth maps to be 
consistent (Szeliski 1999a; Kolmogorov and Zabih 2002; Kang and Szeliski 2004; Maitre, 
Shinagawa, and Do 2008; Zhang, Jia et al. 2008; Yan, Wei et al. 2020; Zhang, Yao et al. 
2020). 

The second category consists of papers that compute true 3D volumetric or surface-based 
object models. Again, because of the large number of papers published on this topic, rather 
than citing them here, we refer you to the material in Section 12.7.2, the surveys by Seitz, 
Curless et al. (2006), Furukawa and Hernández (2015), and Janai, Giiney et al. (2020), and 
the online evaluation websites listed in Table 12.1. 

The topic of monocular depth inference is currently very active. Good places to start, in 
addition to Section 12.8, are the recent surveys by Poggi, Tosi et al. (2021, Section 7) and 
Zhao, Sun ef al. (2020). 


12.10 Exercises 


Ex 12.1: Stereo pair rectification. Implement the following simple algorithm (Section 12.1.1): 


1. Rotate both cameras so that they are looking perpendicular to the line joining the two 
camera centers cy and c1. The smallest rotation can be computed from the cross prod- 


uct between the original and desired optical axes. 


2. Twist the optical axes so that the horizontal axis of each camera looks in the direction 
of the other camera. (Again, the cross product between the current x-axis after the first 


rotation and the line joining the cameras gives the rotation.) 


3. If needed, scale up the smaller (less detailed) image so that it has the same resolution 


(and hence line-to-line correspondence) as the other image. 


Now compare your results to the algorithm proposed by Loop and Zhang (1999). Can you 


think of situations where their approach may be preferable? 
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Ex 12.2: Rigid direct alignment. Modify your spline-based or optical flow motion estima- 
tor from Exercise 9.4 to use epipolar geometry, i.e. to only estimate disparity. 

(Optional) Extend your algorithm to simultaneously estimate the epipolar geometry (with- 
out first using point correspondences) by estimating a base homography corresponding to a 
reference plane for the dominant motion and then an epipole for the residual parallax (mo- 


tion). 


Ex 12.3: Shape from profiles. Reconstruct a surface model from a series of edge images 
(Section 12.2.1). 


1. Extract edges and link them (Exercises 7.7-7.8). 


2. Based on previously computed epipolar geometry, match up edges in triplets (or longer 


sets) of images. 


3. Reconstruct the 3D locations of the curves using osculating circles (Szeliski and Weiss 
1998). 


4. Render the resulting 3D surface model as a sparse mesh, i.e., drawing the reconstructed 
3D profile curves and links between 3D points in neighboring images with similar 


osculating circles. 


Ex 12.4: Plane sweep. Implement a plane sweep algorithm (Section 12.1.2). 

If the images are already pre-rectified, this consists simply of shifting images relative to 
each other and comparing pixels. If the images are not pre-rectified, compute the homography 
that resamples the target image into the reference image’s coordinate system for each plane. 

Evaluate a subset of the following similarity measures (Section 12.3.1) and compare their 
performance by visualizing the disparity space image (DSI), which should be dark for pixels 


at correct depths: 
e squared difference (SD); 
e absolute difference (AD); 
e truncated or robust measures; 
e gradient differences; 
e rank or census transform (the latter usually performs better); 


e mutual information from a precomputed joint density function. 
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Consider using the Birchfield and Tomasi (1998) technique of comparing ranges between 
neighboring pixels (different shifted or warped images). Also, try pre-compensating images 


for bias or gain variations using one or more of the techniques discussed in Section 12.3.1. 


Ex 12.5: Aggregation and window-based stereo. Implement one or more of the matching 


cost aggregation strategies described in Section 12.4: 
e convolution with a box or Gaussian kernel; 
e shifting window locations by applying a min filter (Scharstein and Szeliski 2002); 
e picking a window that maximizes some match-reliability metric (Veksler 2001, 2003); 
e weighting pixels by their similarity to the central pixel (Yoon and Kweon 2006). 


Once you have aggregated the costs in the DSI, pick the winner at each pixel (winner-take- 


all), and then optionally perform one or more of the following post-processing steps: 


1. compute matches both ways and pick only the reliable matches (draw the others in 


another color); 
2. tag matches that are unsure (whose confidence is too low); 
3. fill in the matches that are unsure from neighboring values; 


4. refine your matches to sub-pixel disparity by either fitting a parabola to the DSI values 


around the winner or by using an iteration of Lukas—Kanade. 


Ex 12.6: Optimization-based stereo. Compute the disparity space image (DSI) volume 
using one of the techniques you implemented in Exercise 12.4 and then implement one (or 
more) of the global optimization techniques described in Section 12.5 to compute the depth 
map. Potential choices include: 


dynamic programming or scanline optimization (relatively easy); 


semi-global optimization (Hirschmiiller 2008), which is a simple extension of scanline 


optimization and performs well; 


graph cuts using alpha expansions (Boykov, Veksler, and Zabih 2001), for which you 


will need to find a max-flow or min-cut algorithm (https://vision.middlebury.edu/stereo); 


loopy belief propagation (Freeman, Pasztor, and Carmichael 2000); 


deep neural networks, as described in Section 12.6. 
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Evaluate your algorithm by running it on the Middlebury stereo datasets. 
How well does your algorithm do against local aggregation (Yoon and Kweon 2006)? 


Can you think of some extensions or modifications to make it even better? 


Ex 12.7: View interpolation, revisited. Compute a dense depth map using one of the tech- 
niques you developed above and use it (or, better yet, a depth map for each source image) to 
generate smooth in-between views from a stereo dataset. 

Compare your results against using the ground truth depth data (if available). 

What kinds of artifacts do you see? Can you think of ways to reduce them? 

More details on implementing such algorithms can be found in Section 14.1 and Exercises 
14.1-14.4. 


Ex 12.8: Multi-frame stereo. Extend one of your previous techniques to use multiple input 
frames (Section 12.7) and try to improve the results you obtained with just two views. 

If helpful, try using temporal selection (Kang and Szeliski 2004) to deal with the increased 
number of occlusions in multi-frame datasets. 

You can also try to simultaneously estimate multiple depth maps and make them consis- 
tent (Kolmogorov and Zabih 2002; Kang and Szeliski 2004). 

Or just use one of the latest DNN-based multi-view stereo algorithms. 

Test your algorithms out on some standard multi-view datasets. 


Ex 12.9: Volumetric stereo. Implement voxel coloring (Seitz and Dyer 1999) as a simple 
extension to the plane sweep algorithm you implemented in Exercise 12.4. 


1. Instead of computing the complete DSI all at once, evaluate each plane one at a time 


from front to back. 


2. Tag every voxel whose photoconsistency is below a certain threshold as being part of 
the object and remember its average (or robust) color (Seitz and Dyer 1999; Eisert, 
Steinbach, and Girod 2000; Kutulakos 2000; Slabaugh, Culbertson et al. 2004). 


3. Erase the input pixels corresponding to tagged voxels in the input images, e.g., by 


setting their alpha value to 0 (or to some reduced number, depending on occupancy). 


4. As you evaluate the next plane, use the source image alpha values to modify your 
photoconsistency score, e.g., only consider pixels that have full alpha or weight pixels 


by their alpha values. 


5. If the cameras are not all on the same side of your plane sweeps, use space carving 
(Kutulakos and Seitz 2000) to cycle through different subsets of source images while 


carving away the volume from different directions. 
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Ex 12.10: Depth map merging. Use the technique you developed for multi-frame stereo 
in Exercise 12.8 or a different technique, such as the one described by Goesele, Snavely et al. 
(2007), to compute a depth map for every input image. 

Merge these depth maps into a coherent 3D model, e.g., using Poisson surface reconstruc- 
tion (Kazhdan, Bolitho, and Hoppe 2006). 


Ex 12.11: Monocular depth estimation. Test out of the recent monocular depth inference 
algorithms on your own images. Can you create a “3D photo” effect where wiggling your 
camera or moving your mouse makes the photo move in 3D. Tabulate the failure cases of the 


depth inference and conjecture some possible reasons and/or avenues for improvement. 
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(2) (h) (i) 


Figure 13.1 3D shape acquisition and modeling techniques: (a) shaded image (Zhang, 
Tsai et al. 1999) O 1999 IEEE; (b) texture gradient (Gárding 1992) © 1992 Springer; (c) 
real-time depth from focus (Nayar, Watanabe, and Noguchi 1996) © 1996 IEEE; (d) scan- 
ning a scene with a stick shadow (Bouguet and Perona 1999) © 1999 Springer; (e) merging 
range maps into a 3D model (Curless and Levoy 1996) © 1996 ACM; (f) point-based surface 
modeling (Pauly, Keiser et al. 2003) © 2003 ACM; (g) automated modeling of a 3D building 
using lines and planes (Werner and Zisserman 2002) © 2002 Springer; (h) 3D face model 
from spacetime stereo (Zhang, Snavely et al. 2004) © 2004 ACM; (i) whole body, expression, 
and gesture fitting from a single image (Pavlakos, Choutas et al. 2019) © 2019 IEEE. 


13 3D reconstruction 


13.9 Exercises 
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As we saw in the previous chapter, many stereo matching techniques have been developed 
to reconstruct high-quality 3D models from two or more images. However, stereo is just 
one of the many potential cues that can be used to infer shape from images. In this chapter, 
we investigate a number of such techniques, which include not only visual cues such as 
shading and focus, but also techniques for merging multiple range or depth images into 3D 
models, as well as techniques for reconstructing specialized models, such as heads, bodies, 
or architecture. 

Among the various cues that can be used to infer shape, the shading on a surface (Fig- 
ure 13.1a) can provide a lot of information about local surface orientations and hence overall 
surface shape (Section 13.1.1). This approach becomes even more powerful when lights 
shining from different directions can be turned on and off separately (photometric stereo). 
Texture gradients (Figure 13.1b), i.e., the foreshortening of regular patterns as the surface 
slants or bends away from the camera, can provide similar cues on local surface orientation 
(Section 13.1.2). Focus is another powerful cue to scene depth, especially when two or more 
images with different focus settings are used (Section 13.1.3). 

3D shape can also be estimated using active illumination techniques such as light stripes 
(Figure 13.1d) or time of flight range finders (Section 13.2). The partial surface models 
obtained using such techniques (or passive image-based stereo) can then be merged into more 
coherent 3D surface models (Figure 13.1e), as discussed in Section 13.2.1. Such techniques 
have been used to construct highly detailed and accurate models of cultural heritage such as 
historic sites (Section 13.2.2). The resulting surface models can then be simplified to support 
viewing at different resolutions and streaming across the web (Section 13.3.2). An alternative 
to working with continuous surfaces is to represent 3D surfaces as dense collections of 3D 
oriented points (Section 13.4) or as volumetric primitives (Section 13.5). 

3D modeling can be more efficient and effective if we know something about the objects 
we are trying to reconstruct. In Section 13.6, we look at three specialized but commonly 
occurring examples, namely architecture (Figure 13.1g), heads and faces (Figure 13.1h), and 
whole bodies (Figure 13.11). In addition to modeling people, we also discuss techniques for 
tracking them. 

The last stage of shape and appearance modeling is to extract some colored textures to 
paint onto our 3D models (Section 13.7). Some techniques go beyond this and actually esti- 
mate full BRDFs (Section 13.7.1), although if there is no desire to re-light the scene, Surface 
Light Fields may be easier to acquire (Section 14.3.2). 

Because there exists such a large variety of techniques to perform 3D modeling, this 
chapter does not go into detail on any one of these. Readers are encouraged to find more 


information in the cited references and recent computer vision conferences, as well as more 
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specialized conferences devoted to these topics, e.g., the International Conference on 3D 
Vision (3DV) and the IEEE International Conference on Automatic Face and Gesture Recog- 
nition (FG). 


13.1 Shape from X 


In addition to binocular disparity, shading, texture, and focus all play a role in how we per- 
ceive shape. The study of how shape can be inferred from such cues is sometimes called 
shape from X, because the individual instances are called shape from shading, shape from 
texture, and shape from focus. In this section, we look at these three cues and how they can 
be used to reconstruct 3D geometry. A good overview of all these topics can be found in the 
collection of papers on physics-based shape inference edited by Wolff, Shafer, and Healey 
(1992b), the survey by Ackermann and Goesele (2015) and the book by Ikeuchi, Matsushita 
et al. (2020). 


13.1.1 Shape from shading and photometric stereo 


When you look at images of smooth shaded objects, such as the ones shown in Figure 13.2, 
you can clearly see the shape of the object from just the shading variation. How is this 
possible? The answer is that as the surface normal changes across the object, the apparent 
brightness changes as a function of the angle between the local surface orientation and the 
incident illumination, as shown in Figure 2.15 (Section 2.2.2). 

The problem of recovering the shape of a surface from this intensity variation is known as 
shape from shading and is one of the classic problems in computer vision (Horn 1975). The 
collection of papers edited by Horn and Brooks (1989) is a great source of information on 
this topic, especially the chapter on variational approaches. The survey by Zhang, Tsai et al. 
(1999) not only reviews more recent techniques, but also provides some comparative results. 

Most shape from shading algorithms assume that the surface under consideration is of a 
uniform albedo and reflectance, and that the light source directions are either known or can 
be calibrated by the use of a reference object. Under the assumptions of distant light sources 
and observer, the variation in intensity (irradiance equation) becomes purely a function of 


the local surface orientation, 


I (x,y) = R(p(z, y), a(x, y)), (13.1) 


'We have already seen examples of shape from stereo, shape from profiles, and shape from silhouettes in Chap- 
ter 12. 
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Figure 13.2 Synthetic shape from shading (Zhang, Tsai et al. 1999) © 1999 IEEE: shaded 
images, (a—b) with light from in front (0,0,1) and (c-d) with light from the front right 
(1,0, 1); (e-f) corresponding shape from shading reconstructions using the technique of Tsai 
and Shah (1994). 


where (p,q) = (Zx, Zy) are the depth map derivatives and R(p, q) is called the reflectance 
map. For example, a diffuse (Lambertian) surface has a reflectance map that is the (non- 


negative) dot product (2.89) between the surface normal ñ = (p, q, 1)/ y 1 + p? + q? and the 
light source direction v = (vz, Uy, Uz), 


yl pee 


where p is the surface reflectance factor (albedo). 


R(p, q) = max 0 piety ts) (13.2) 


In principle, Equations (13.1-13.2) can be used to estimate (p, q) using non-linear least 
squares or some other method. Unfortunately, unless additional constraints are imposed, there 


are more unknowns per pixel (p,q) than there are measurements (7). One commonly used 
constraint is the smoothness constraint, 


Es = [+++ Gaedy = | (Vpl? + Val de dy, (13.3) 
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which we have already seen in Section 4.2 (4.18). The other is the integrability constraint, 


&= Jo — qu)” dz dy, (13.4) 


which arises naturally, because for a valid depth map z(x, y) with (p, q) = (zz, Zy), we have 
Py = Zry = “ya = qz. 

Instead of first recovering the orientation fields (p,q) and integrating them to obtain a 
surface, it is also possible to directly minimize the discrepancy in the image formation equa- 
tion (13.1) while finding the optimal depth map z(x, y) (Horn 1990). Unfortunately, shape 
from shading is susceptible to local minima in the search space and, like other variational 
problems that involve the simultaneous estimation of many variables, can also suffer from 
slow convergence. Using multi-resolution techniques (Szeliski 1991a) can help accelerate 
the convergence, while using more sophisticated optimization techniques (Dupuis and Olien- 
sis 1994) can help avoid local minima. 

In practice, surfaces other than plaster casts are rarely of a single uniform albedo. Shape 
from shading therefore needs to be combined with some other technique or extended in some 
way to make it useful. One way to do this is to combine it with stereo matching (Fua and 
Leclerc 1995; Logothetis, Mecca, and Cipolla 2019) or known texture (surface patterns) 
(White and Forsyth 2006). The stereo and texture components provide information in tex- 
tured regions, while shape from shading helps fill in the information across uniformly colored 


regions and also provides finer information about surface shape. 


Photometric stereo. Another way to make shape from shading more reliable is to use mul- 
tiple light sources that can be selectively turned on and off. This technique is called photo- 
metric stereo, as the light sources play a role analogous to the cameras located at different 
locations in traditional stereo (Woodham 1981).? For each light source, we have a differ- 
ent reflectance map, Ri(p,q), Ra(p, q), etc. Given the corresponding intensities J4, I2, etc. 
at a pixel, we can in principle recover both an unknown albedo p and a surface orientation 
estimate (p, q). 

For diffuse surfaces (13.2), if we parameterize the local orientation by ñ, we get (for 


non-shadowed pixels) a set of linear equations of the form 
Ik = pn: Vr, (13.5) 


from which we can recover pñ using linear least squares. These equations are well condi- 
tioned as long as the (three or more) vectors vz are linearly independent, i.e., they are not 


along the same azimuth (direction away from the viewer). 


? An alternative to turning lights on-and-off is to use three colored lights (Woodham 1994; Hernandez, Vogiatzis 
et al. 2007; Hernandez and Vogiatzis 2010). 
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Figure 13.3  Multi-view photometric stereo (Logothetis, Mecca, and Cipolla 2019) © 2019 
IEEE: initial COLMAP multi-view stereo reconstruction; refined with (Park, Sinha et al. 
2017); and (Logothetis, Mecca, and Cipolla 2019). 


Once the surface normals or gradients have been recovered at each pixel, they can be 
integrated into a depth map using a variant of regularized surface fitting (4.24). Nehab, 
Rusinkiewicz et al. (2005) and Harker and O’ Leary (2008) discuss more sophisticated tech- 
niques for doing this. The combination of multi-view stereo for coarse shape and photometric 
stereo for fine detail continues to be an active area of research (Hernandez, Vogiatzis, and 
Cipolla 2008; Wu, Liu et al. 2010; Park, Sinha et al. 2017). Logothetis, Mecca, and Cipolla 
(2019) describe such a system that can produce very high-quality scans (Figure 13.3), al- 
though it requires a sophisticated laboratory setup. A more practical setup that only requires 
a stereo camera and a flash to produce a flash/non-flash pair is describe by Cao, Waechter et 
al. (2020). It is also possible to apply photometric stereo to outdoor web camera sequences 
(Figure 13.4), using the trajectory of the Sun as a variable direction illuminator (Ackermann, 
Langguth et al. 2012). 

When surfaces are specular, more than three light directions may be required. In fact, 
the irradiance equation given in (13.1) not only requires that the light sources and camera be 
distant from the surface, it also neglects inter-reflections, which can be a significant source 
of the shading observed on object surfaces, e.g., the darkening seen inside concave structures 
such as grooves and crevasses (Nayar, Ikeuchi, and Kanade 1991). However, if one can 
control the placements of lights and cameras so that they are reciprocal, i.e., the position 
of lights and cameras can be (conceptually) switched, it is possible to recover constraints 
on surface depths and normals using a procedure known as Helmholtz stereopsis (Zickler, 
Belhumeur, and Kriegman 2002). 

While earlier work on photometric stereo assumed known illuminant directions and re- 


flectance (BRDF) functions, more recent work aims to loosen these constraints. Ackermann 
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Figure 13.4 Webcam-based outdoor photometric stereo (Ackermann, Langguth et al. 2012) 


© 2012 IEEE: an input image, the recovered normal map, three basis BRDFs below their 
respective material maps, and a synthetic rendering from a new sun position. 
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Figure 13.5 Synthetic shape from texture (Gárding 1992) O 1992 Springer: (a) regular 
texture wrapped onto a curved surface and (b) the corresponding surface normal estimates. 
Shape from mirror reflections (Savarese, Chen, and Perona 2005) O 2005 Springer: (c) a 
regular pattern reflecting off a curved mirror gives rise to (d) curved lines, from which 3D 
point locations and normals can be inferred. 


and Goesele (2015) provide an extensive survey of such techniques, while Shi, Mo et al. 
(2019) describe their DiLiGenT dataset and benchmark for evaluating non-Lambertian pho- 
tometric stereo and cite over 100 related papers. As with other areas of computer vision, deep 
networks and end-to-end learning are now commonly used to to recover shape and illuminant 
direction from photometrics stereo. Some recent papers include Chen, Han et al. (2019), Li, 
Robles-Kelly et al. (2019), Haefner, Ye et al. (2019), Chen, Waechter et al. (2020), and Santo, 
Waechter, and Matsushita (2020). 


814 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


13.1.2 Shape from texture 


The variation in foreshortening observed in regular textures can also provide useful informa- 
tion about local surface orientation. Figure 13.5 shows an example of such a pattern, along 
with the estimated local surface orientations. Shape from texture algorithms require a num- 
ber of processing steps, including the extraction of repeated patterns or the measurement of 
local frequencies to compute local affine deformations, and a subsequent stage to infer local 
surface orientation. Details on these various stages can be found in the research literature 
(Witkin 1981; Ikeuchi 1981; Blostein and Ahuja 1987; Gárding 1992; Malik and Rosenholtz 
1997; Lobay and Forsyth 2006). A more recent paper uses a generative model to represent the 
repetitive appearance of textures and jointly optimizes the model along with the local surface 
orientations at every pixel (Verbin and Zickler 2020). 

When the original pattern is regular, it is possible to fit a regular but slightly deformed 
grid to the image and use this grid for a variety of image replacement or analysis tasks (Liu, 
Collins, and Tsin 2004; Liu, Lin, and Hays 2004; Hays, Leordeanu et al. 2006; Lin, Hays 
et al. 2006; Park, Brocklehurst et al. 2009). This process becomes even easier if specially 
printed textured cloth patterns are used (White and Forsyth 2006; White, Crane, and Forsyth 
2007). 

The deformations induced in a regular pattern when it is viewed in the reflection of a 
curved mirror, as shown in Figure 13.5c-d, can be used to recover the shape of the surface 
(Savarese, Chen, and Perona 2005; Rozenfeld, Shimshoni, and Lindenbaum 2011). It is also 
possible to infer local shape information from specular flow, i.e., the motion of specularities 
when viewed from a moving camera (Oren and Nayar 1997; Zisserman, Giblin, and Blake 
1989; Swaminathan, Kang et al. 2002). 


13.1.3 Shape from focus 


A strong cue for object depth is the amount of blur, which increases as the object’s surface 
moves away from the camera’s focusing distance. As shown in Figure 2.19, moving the object 
surface away from the focus plane increases the circle of confusion, according to a formula 
that is easy to establish using similar triangles (Exercise 2.4). 

A number of techniques have been developed to estimate depth from the amount of de- 
focus (depth from defocus) (Pentland 1987; Nayar and Nakagawa 1994; Nayar, Watanabe, 
and Noguchi 1996; Watanabe and Nayar 1998; Chaudhuri and Rajagopalan 1999; Favaro and 


Soatto 2006). To make such a technique practical, a number of issues need to be addressed: 


e The amount of blur increase in both directions as you move away from the focus plane. 


Therefore, it is necessary to use two or more images captured with different focus 
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Figure 13.6 Real-time depth from defocus (Nayar, Watanabe, and Noguchi 1996) © 1996 


IEEE: (a) the real-time focus range sensor, which includes a half-silvered mirror between 


the two telecentric lenses (lower right), a prism that splits the image into two CCD sensors 


(lower left), and an edged checkerboard pattern illuminated by a Xenon lamp (top); (b-c) 


input video frames from the two cameras along with (d) the corresponding depth map; (e-f) 


two frames (you can see the texture if you zoom in) and (g) the corresponding 3D mesh model. 


distance settings (Pentland 1987; Nayar, Watanabe, and Noguchi 1996) or to translate 
the object in depth and look for the point of maximum sharpness (Nayar and Nakagawa 
1994). 


The magnification of the object can vary as the focus distance is changed or the object is 
moved. This can be modeled either explicitly (making correspondence more difficult) 
or using telecentric optics, which approximate an orthographic camera and require an 


aperture in front of the lens (Nayar, Watanabe, and Noguchi 1996). 


The amount of defocus must be reliably estimated. A simple approach is to average 
the squared gradient in a region, but this suffers from several problems, including the 
image magnification problem mentioned above. A better solution is to use carefully 
designed rational filters (Watanabe and Nayar 1998). 


Figure 13.6 shows an example of a real-time depth from defocus sensor, which employs 


two imaging chips at slightly different depths sharing a common optical path, as well as an 


active illumination system that projects a checkerboard pattern from the same direction. As 


you can see in Figure 13.6b-g, the system produces high-accuracy real-time depth maps for 


both static and dynamic scenes. 
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Figure 13.7 Range data scanning (Curless and Levoy 1996) © 1996 ACM: (a) a laser dot 
on a surface is imaged by a CCD sensor; (b) a laser stripe (sheet) is imaged by the sensor 
(the deformation of the stripe encodes the distance to the object); (c) the resulting set of 3D 


points are turned into (d) a triangulated mesh. 


13.2 3D scanning 


As we have seen in the previous section, actively lighting a scene, whether for the purpose of 
estimating normals using photometric stereo, or for adding artificial texture for shape from 
defocus, can greatly improve the performance of vision systems. This kind of active illu- 
mination has been used from the earliest days of machine vision to construct highly precise 
sensors for estimating 3D depth images using a variety of rangefinding (or range sensing) 
techniques (Besl 1989; Curless 1999; Hebert 2000; Zhang 2018).* While rangefinders such 
as lidar (Light Detection and Ranging) and laser-based 3D scanners were once limited to 
commercial and laboratory applications, the development of low-cost depth cameras such as 
the Microsoft Kinect (Zhang 2012) have revolutionized many aspects of computer vision. It 
is now common to refer to the registered color and depth frames produced by such cameras 
as RGB-D (or RGBD) images (Silberman, Hoiem et al. 2012). 

One of the early active illumination sensors used in computer vision and computer graph- 
ics was a laser or light stripe sensor, which sweeps a plane of light across the scene or object 
while observing it from an offset viewpoint, as shown in Figure 13.7b (Rioux and Bird 1993; 
Curless and Levoy 1995). As the stripe falls across the object, it deforms its shape according 
to the shape of the surface it is illuminating. Itis then a simple matter of using optical tri- 
angulation to estimate the 3D locations of all the points seen in a particular stripe. In more 
detail, knowledge of the 3D plane equation of the light stripe allows us to infer the 3D lo- 
cation corresponding to each illuminated pixel, as previously discussed in (2.70-2.71). The 


accuracy of light striping techniques can be improved by finding the exact temporal peak in 


3Rangefinding is an old-fashioned word for measuring distance, often using passive or active optical means. 
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(b) (c) 


Figure 13.8 Shape scanning using cast shadows (Bouguet and Perona 1999) O 1999 
Springer: (a) camera setup with a point light source (a desk lamp without its reflector), a 
hand-held stick casting a shadow, and (b) the objects being scanned in front of two planar 
backgrounds. (c) Real-time depth map using a pulsed illumination system (Iddan and Yahav 
2001) © 2001 SPIE. 


illumination for each pixel (Curless and Levoy 1995). The final accuracy of a scanner can 
be determined using slant edge modulation techniques, i.e., by imaging sharp creases in a 
calibration object (Goesele, Fuchs, and Seidel 2003). 

An interesting variant on light stripe rangefinding is presented by Bouguet and Perona 
(1999). Instead of projecting a light stripe, they simply wave a stick casting a shadow over a 
scene or object illuminated by a point light source such as a lamp or the Sun (Figure 13.8a). 
As the shadow falls across two background planes whose orientation relative to the camera is 
known (or inferred during pre-calibration), the plane equation for each stripe can be inferred 
from the two projected lines, whose 3D equations are known (Figure 13.8b). The deformation 
of the shadow as it crosses the object being scanned then reveals its 3D shape, as with regular 
light stripe rangefinding (Exercise 13.2). This technique can also be used to estimate the 3D 
geometry of a background scene and how its appearance varies as it moves into shadow, to 
cast new shadows onto the scene (Chuang, Goldman et al. 2003) (Section 10.4.3). 

The time it takes to scan an object using a light stripe technique is proportional to the 
number of depth planes used, which is usually comparable to the number of pixels across 
an image. A much faster scanner can be constructed by turning different projector pixels on 
and off in a structured manner, e.g., using a binary or Gray code (Besl 1989). For example, 
let us assume that the LCD projector we are using has 1,024 columns of pixels. Taking the 
10-bit binary code corresponding to each column’s address (0...1,023), we project the first 
bit, then the second, etc. After 10 projections (e.g., a third of a second for a synchronized 
30Hz camera-projector system), each pixel in the camera knows which of the 1,024 columns 


of projector light it is seeing. A similar approach can also be used to estimate the refractive 
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(a) (b) 


Figure 13.9 The Microsoft Kinect depth camera (Zhang 2012) O 2012 IEEE: (a) the hard- 
ware, comprising an infrared (IR) speckle pattern projector and a color and IR camera pair; 
(b) close-up of a sample infrared image, showing the projected dots; (c) the final depth map, 


which has black “shadows” in the areas not illuminated by the projector. 


properties of an object by placing a monitor behind the object (Zongker, Werner et al. 1999; 
Chuang, Zongker et al. 2000) (Section 14.4). Very fast scanners can also be constructed 
with a single laser beam, i.e., a real-time flying spot optical triangulation scanner (Rioux, 
Bechthold et al. 1987). 

If even faster, i.e., frame-rate, scanning is required, we can project a single textured pat- 
tern into the scene. Proesmans, Van Gool, and Defoort (1998) describe a system where a 
checkerboard grid is projected onto an object and the deformation of the grid is used to infer 
3D shape. Unfortunately, such a technique only works if the surface is continuous enough 
to link all of the grid points. Instead of projecting a grid, it is also possible to project one or 
more sinusoidal fringe patterns and to then recover deformations in the surface from the rela- 
tive phase displacements using a process called fringe projection profilometry (Su and Zhang 
2010; Zuo, Huang et al. 2016; Zhang 2018). 

The Microsoft Kinect (Zhang 2012) depth camera uses a variant of this technique, pro- 
jecting an infrared (IR) speckle pattern, which looks like a bunch of random dots, but which 
in fact consists of a known calibrated pseudo-random pattern (Figure 13.9). By measuring the 
horizontal displacement (parallax) between the dots seen in the IR camera and their expected 
locations, a depth map can be computed, interpolating over the pixels not illuminated by the 
dots (Fanello, Rhemann et al. 2016; Fanello, Valentin et al. 2017b). Since its release, the 
Kinect camera has been widely used in computer vision research (Zhang 2012; Han, Shao et 
al. 2013), as well as applications such as 3D body tracking (Section 13.6.4) and object scan- 
ning and home interior reconstruction (Section 13.2.1). Kinect sensors were used to create 
the first widely used dataset for 3D semantic scene understanding (Silberman, Hoiem et al. 
2012), although larger 3D scanned datasets have since been created (Dai, Chang et al. 2017). 

A higher resolution system can be constructed using high-speed custom illumination and 
sensing hardware. Iddan and Yahav (2001) describe the construction of their 3DV Zcam 
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Figure 13.10 Real-time dense 3D face capture using spacetime stereo (Zhang, Snavely 
et al. 2004) © 2004 ACM: (a) set of five consecutive video frames from one of two stereo 
cameras (every fifth frame is free of stripe patterns, in order to extract texture); (b) resulting 
high-quality 3D surface model (depth map visualized as a shaded rendering). 


video-rate depth sensing camera, which projects a pulsed plane of light onto the scene and 
then integrates the returning light for a short interval, essentially obtaining time-of-flight 
measurement for the distance to individual pixels in the scene. A good description of ear- 
lier time-of-flight systems, including amplitude and frequency modulation schemes for lidar, 
can be found in (Besl 1989), and a more recent description can be found in the book by 
Hansard, Lee et al. (2012). While the initial version of the Microsoft Kinect depth cam- 
era used a speckle pattern structured light system (Zhang 2012), the newer Kinect V2 uses 
a time-of-flight (ToF) sensor that uses phase measurements of amplitude-modulated light 
signals (Bamji, O'Connor et al. 2014). Traditional multi-frequency phase unwrapping tech- 
niques can be used to estimate absolute depth, but more accurate depths for dynamic scenes 
can be obtained by simultaneously modeling depths and object velocities (Stiihmer, Nowozin 
et al. 2015). 


Instead of using a single camera, it is also possible to construct an active illumination 
range sensor using stereo imaging setups, resulting in a system that is often called active (illu- 
mination) stereo. The simplest way to do this is to just project random stripe patterns onto the 
scene to create synthetic texture, which helps match textureless surfaces (Kang, Webb et al. 
1995). Projecting a known series of stripes, just as in coded pattern single-camera rangefind- 
ing, makes the correspondence between pixels unambiguous and allows for the recovery of 
depth estimates at pixels only seen in a single camera (Scharstein and Szeliski 2003). This 
technique has been used to produce large numbers of highly accurate registered multi-image 
stereo pairs and depth maps for the purpose of evaluating stereo correspondence algorithms 
(Scharstein and Szeliski 2002; Hirschmiiller and Scharstein 2009; Scharstein, Hirschmiiller 
et al. 2014) and learning depth map priors and parameters (Pal, Weinman et al. 2012). Care- 


fully designed algorithms can perform local matching of patterns at 500Hz (Fanello, Valentin 
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et al. 2017a,b). 

While projecting multiple patterns usually requires the scene or object to remain still, 
additional processing can enable the production of real-time depth maps for dynamic scenes. 
The basic idea (Davis, Ramamoorthi, and Rusinkiewicz 2003; Zhang, Curless, and Seitz 
2003) is to assume that depth is nearly constant within a 3D space-time window around each 
pixel and to use the 3D window for matching and reconstruction. Depending on the surface 
shape and motion, this assumption may be error-prone, as shown in Davis, Nahab ef al. 
(2005). To model shapes more accurately, Zhang, Curless, and Seitz (2003) model the linear 
disparity variation within the space-time window and show that better results can be obtained 
by globally optimizing disparity and disparity gradient estimates over video volumes (Zhang, 
Snavely et al. 2004). Figure 13.10 shows the results of applying this system to a person’s 
face; the frame-rate 3D surface model can then be used for further model-based fitting and 
computer graphics manipulation (Section 13.6.2). As mentioned previously, motion modeling 
can also be applied to phase-based time-of-flight sensors (Stiihmer, Nowozin et al. 2015). 

One word of caution about active range sensing. When the surfaces being scanned are too 
reflective, the camera may see a reflection off the object’s surface and assume that this virtual 
image is the true scene. For surfaces with moderate amounts of reflection, such as the ceramic 
models in Wood, Azuma et al. (2000) or the Corn Cho puffs in Park, Newcombe, and Seitz 
(2018), there is still sufficient diffuse reflection under the specular layer to obtain a 3D range 
map. (The specular part can then be recovered separately to produce a surface light field, as 
described in Section 14.3.2.) However, for true mirrors, active range scanners will invariably 
capture the virtual 3D model seen reflected in the mirror, so that additional techniques such 


as looking for a reflection of the scanning device must be used (Whelan, Goesele et al. 2018). 


13.2.1 Range data merging 


While individual range images can be useful for applications such as real-time z-keying or fa- 
cial motion capture, they are often used as building blocks for more complete 3D object mod- 
eling. In such applications, the next two steps in processing are the registration (alignment) of 
partial 3D surface models and their integration into coherent 3D surfaces (Curless 1999). If 
desired, this can be followed by a model fitting stage using either parametric representations 
such as generalized cylinders (Agin and Binford 1976; Nevatia and Binford 1977; Marr and 
Nishihara 1978; Brooks 1981), superquadrics (Pentland 1986; Solina and Bajcsy 1990; Ter- 
zopoulos and Metaxas 1991), or non-parametric models such as triangular meshes (Boissonat 
1984) or physically based models (Terzopoulos, Witkin, and Kass 1988; Delingette, Hebert, 
and Ikeuichi 1992; Terzopoulos and Metaxas 1991; McInerney and Terzopoulos 1993; Ter- 


zopoulos 1999). A number of techniques have also been developed for segmenting range 
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images into simpler constituent surfaces (Hoover, Jean-Baptiste et al. 1996). 

The most widely used 3D registration technique is the iterative closest point (ICP) algo- 
rithm, which alternates between finding the closest point matches between the two surfaces 
being aligned and then solving a 3D absolute orientation problem (Section 8.1.5, (8.31- 
8.32) (Besl and McKay 1992; Chen and Medioni 1992; Zhang 1994; Szeliski and Lavallée 
1996; Gold, Rangarajan et al. 1998; David, DeMenthon et al. 2004; Li and Hartley 2007; 
Engqvist, Josephson, and Kahl 2009). Some techniques, such as the one developed by Chen 
and Medioni (1992), use local surface tangent planes to make this computation more accurate 
and to accelerate convergence. More recently, Rusinkiewicz (2019) proposed a symmetric 
oriented point distance similar to the energy terms used in oriented particles (Szeliski and 
Tonnesen 1992). A nice review of ICP and its related variants can be found in the papers by 
Tam, Cheng et al. (2012) and Pomerleau, Colas, and Siegwart (2015). 

As the two surfaces being aligned usually only have partial overlap and may also have 
outliers, robust matching criteria (Section 8.1.4 and Appendix B.3) are typically used. To 
speed up the determination of the closest point, and also to make the distance-to-surface 
computation more accurate, one of the two point sets (e.g., the current merged model) can 
be converted into a signed distance function, optionally represented using an octree spline 
for compactness (Lavallée and Szeliski 1995). Variants on the basic ICP algorithm can be 
used to register 3D point sets under non-rigid deformations, e.g., for medical applications 
(Feldmar and Ayache 1996; Szeliski and Lavallée 1996). Color values associated with the 
points or range measurements can also be used as part of the registration process to improve 
robustness (Johnson and Kang 1997; Pulli 1999). 

Unfortunately, the ICP algorithm and its variants can only find a locally optimal alignment 
between 3D surfaces. If this is not known a priori, more global correspondence or search 
techniques, based on local descriptors invariant to 3D rigid transformations, need to be used. 
An example of such a descriptor is the spin image, which is a local circular projection of a 
3D surface patch around the local normal axis (Johnson and Hebert 1999). Another (earlier) 
example is the splash representation introduced by Stein and Medioni (1992). More recent 
work along these lines studies the problem of pose estimation (Section 11.2) from RGB-D 
images, which is essentially the same problem as aligning a range map to a 3D model. Recent 
papers on this topic (Drost, Ulrich et al. 2010; Brachmann, Michel et al. 2016; Vidal, Lin et 
al. 2018) typically evaluate themselves on the Benchmark for 6DOF Object Pose Estimation,* 
which also hosts a series of yearly workshops on this topic. 

Once two or more 3D surfaces have been aligned, they can be merged into a single model. 
One approach is to represent each surface using a triangulated mesh and combine these 


*https://bop.felk.cvut.cz/home 
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Observed 
isosurface 


Sensor Sensor Sensor 


(a) (b) 


Figure 13.11 Range data merging (Curless and Levoy 1996) O 1996 ACM: (a) two signed 
distance functions (top left) are merged with their (weights) bottom left to produce a combined 
setof functions (right column) from which an isosurface can be extracted (green dashed line); 
(b) the signed distance functions are combined with empty and unseen space labels to fill holes 


in the isosurface. 


meshes using a process that is sometimes called zippering (Soucy and Laurendeau 1992; 
Turk and Levoy 1994). Another, now more widely used, approach is to compute a (truncated) 
signed distance function that fits all of the 3D data points (Hoppe, DeRose et al. 1992; Curless 
and Levoy 1996; Hilton, Stoddart et al. 1996; Wheeler, Sato, and Ikeuchi 1998). 

Figure 13.11 shows one such approach, the volumetric range image processing (VRIP) 
technique developed by Curless and Levoy (1996), which first computes a weighted signed 
distance function from each range image and then merges them using a weighted averaging 
process. To make the representation more compact, run-length coding is used to encode 
the empty, seen, and varying (signed distance) voxels, and only the signed distance values 
near each surface are stored.? Once the merged signed distance function has been computed, 
a zero-crossing surface extraction algorithm, such as marching cubes (Lorensen and Cline 
1987), can be used to recover a meshed surface model. Figure 13.12 shows an example of 
the complete range data merging and isosurface extraction pipeline. Rusinkiewicz, Hall-Holt, 
and Levoy (2002) present a real-time system that combines fast ICP and point-based merging 
and rendering. 

The advent of consumer-level RGB-D cameras such as Kinect created renewed interest 
in large-scale range data registration and merging (Zhang 2012; Han, Shao ef al. 2013). An 
influential paper in this area is Kinect Fusion (Izadi, Kim et al. 2011; Newcombe, Izadi et 


5 An alternative, even more compact, representation could be to use octrees (Lavallée and Szeliski 1995). 
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(a) (b) (c) (d) (e) 


Figure 13.12 Reconstruction and hardcopy of the “Happy Buddha” statuette (Curless and 
Levoy 1996) © 1996 ACM: (a) photograph of the original statue after spray painting with 
matte gray; (b) partial range scan; (c) merged range scans; (d) colored rendering of the 


reconstructed model; (e) hardcopy of the model constructed using stereolithography. 


Figure 13.13 Fusing multiple depth images using the KinectFusion real-time system (New- 
combe, Izadi et al. 2011) © 2011 IEEE. The three images show an original (noisy) range scan, 
rendered as a colored normal map, and the fused 3D model, rendered as both a normal map 
and Phong-shaded. 
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al. 2011), which combines an ICP-like SLAM technique called DTAM (Newcombe, Love- 
grove, and Davison 2011) with real-time TSDF (truncated signed distance function) volu- 
metric integration, which is described in more detail in Section 13.5.1. Follow-on papers 
include Elastic Fragments for non-rigid alignment (Zhou, Miller, and Koltun 2013), Oc- 
tomap (Hornung, Wurm et al. 2013), which uses an octree and probabilistic occupancy, and 
Voxel Hashing (Niefner, Zollhófer et al. 2013) and Chisel (Klingensmith, Dryanovski et 
al. 2015), both of which uses spatial hashing to compress the TSDF. KinectFusion has also 
been extended to handle highly variable scanning resolution (Fuhrmann and Goesele 2011, 
2014), dynamic scenes (DynamicFusion (Newcombe, Fox, and Seitz 2015), VolumeDeform 
(Innmann, Zollhófer et al. 2016), and Motion2Fusion (Dou, Davidson et al. 2017)), to use 
non-rigid surface deformations for global model refinement (ElasticFusion: Whelan, Salas- 
Moreno et al. (2016)), to produce a globally consistent BundleFusion model (Dai, Niefiner 
et al. 2017), and to use a deep network to perform the non-rigid matching (Božič, Zollhófer 
et al. 2020). More details on these and other techniques for constructing 3D models from 
RGB-D scans can be found in the survey by Zollhófer, Stotko et al. (2018). 

Some of the most recent work in range data merging uses neural networks to represent the 
TSDF (Park, Florence et al. 2019), update a TSDF with incoming range data scans (Weder, 
Schonberger et al. 2020, 2021), or provide local priors (Chabra, Lenssen et al. 2020). Range 
data merging techniques are often used for both 3D object scanning and for visual map build- 
ing and navigation (RGB-D SLAM), which we discussed in Section 11.5. And now that 
depth sensing (aka lidar) technology is starting to appear in mobile phones, it can be used 
to build complete texture-mapped 3D room models, e.g., using Occipital’s Canvas app (Stein 
2020).° 

Volumetric range data merging techniques based on signed distance or characteristic 
(inside—outside) functions are also widely used to extract smooth well-behaved surfaces from 
oriented or unoriented point sets (Hoppe, DeRose et al. 1992; Ohtake, Belyaev et al. 2003; 
Kazhdan, Bolitho, and Hoppe 2006; Lempitsky and Boykov 2007; Zach, Pock, and Bischof 
2007b; Zach 2008), as discussed in more detail in Section 13.5.1 and the survey paper by 
Berger, Tagliasacchi et al. (2017). 


13.2.2 Application: Digital heritage 


Active rangefinding technologies, combined with surface modeling and appearance model- 
ing techniques (Section 13.7), are widely used in the fields of archaeological and historical 
preservation, which often also goes under the name digital heritage (MacDonald 2006). In 


Shttps://canvas.io 


13.3 Surface representations 825 


(b) 


Figure 13.14 Laser range modeling of the Bayon temple at Angkor-Thom (Banno, Masuda 
et al. 2008) © 2008 Springer: (a) sample photograph from the site; (b) a detailed head model 
scanned from the ground; (c) final merged 3D model of the temple scanned using a laser 


range sensor mounted on a balloon. 


such applications, detailed 3D models of cultural objects are acquired and later used for ap- 
plications such as analysis, preservation, restoration, and the production of duplicate artwork 
(Rioux and Bird 1993). 

A notable example of such an endeavor is the Digital Michelangelo project of Levoy, 
Pulli et al. (2000), which used Cyberware laser stripe scanners and high-quality digital SLR 
cameras mounted on a large gantry to obtain detailed scans of Michelangelo’s David and other 
sculptures in Florence. The project also took scans of the Forma Urbis Romae, an ancient 
stone map of Rome that had shattered into pieces, for which new matches were obtained 
using digital techniques. The whole process, from initial planning, to software development, 
acquisition, and post-processing, took several years (and many volunteers), and produced a 
wealth of 3D shape and appearance modeling techniques as a result. 

Even larger-scale projects have since been attempted, for example, the scanning of com- 
plete temple sites such as Angkor-Thom (Ikeuchi and Sato 2001; Ikeuchi and Miyazaki 2007; 
Banno, Masuda et al. 2008). Figure 13.14 shows details from this project, including a sample 
photograph, a detailed 3D (sculptural) head model scanned from ground level, and an aerial 


overview of the final merged 3D site model, which was acquired using a balloon. 


13.3 Surface representations 


In previous sections, we have seen different representations being used to integrate 3D range 
scans. We now look at several of these representations in more detail. Explicit surface 


representations, such as triangle meshes, splines (Farin 1992, 2002), and subdivision sur- 
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faces (Stollnitz, DeRose, and Salesin 1996; Zorin, Schröder, and Sweldens 1996; Warren and 
Weimer 2001; Peters and Reif 2008), enable not only the creation of highly detailed models 
but also processing operations, such as interpolation (Section 13.3.1), fairing or smoothing, 
and decimation and simplification (Section 13.3.2). We also examine discrete point-based 


representations (Section 13.4) and volumetric representations (Section 13.5). 


13.3.1 Surface interpolation 


One of the most common operations on surfaces is their reconstruction from a set of sparse 
data constraints, i.e., scattered data interpolation, which we covered in Section 4.1. When 
formulating such problems, surfaces may be parameterized as height fields f(x), as 3D para- 
metric surfaces f(x), or as non-parametric models such as collections of triangles. 

In Section 4.2, we saw how two-dimensional function interpolation and approximation 
problems {d;} > f(x) could be cast as energy minimization problems using regularization 
(4.18-4.23). Such problems can also specify the locations of discontinuities in the surface as 
well as local orientation constraints (Terzopoulos 1986b; Zhang, Dugas-Phocion et al. 2002). 

One approach to solving such problems is to discretize both the surface and the energy 
on a discrete grid or mesh using finite element analysis (4.24-4.27) (Terzopoulos 1986b). 
Such problems can then be solved using sparse system solving techniques, such as multigrid 
(Briggs, Henson, and McCormick 2000) or hierarchically preconditioned conjugate gradient 
(Szeliski 2006b; Krishnan and Szeliski 2011; Krishnan, Fattal, and Szeliski 2013). The sur- 
face can also be represented using a hierarchical combination of multilevel B-splines (Lee, 
Wolberg, and Shin 1997). 

An alternative approach is to use radial basis (or kernel) functions (Boult and Kender 
1986; Nielson 1993), which we covered in Section 4.1.1. As we mentioned in that section, 
if we want the function f(x) to exactly interpolate the data points, a dense linear system must 
be solved to determine the magnitude associated with each basis function (Boult and Kender 
1986). It turns out that, for certain regularized problems, e.g., (4.18-4.21), there exist radial 
basis functions (kernels) that give the same results as a full analytical solution (Boult and 
Kender 1986). Unfortunately, because the dense system solving is cubic in the number of 
data points, basis function approaches can only be used for small problems such as feature- 
based image morphing (Beier and Neely 1992). 

When a three-dimensional parametric surface is being modeled, the vector-valued func- 
tion f in (4.18-4.27) encodes 3D coordinates (x,y,z) on the surface and the domain x = 
(s,t) encodes the surface parameterization. One example of such surfaces are symmetry- 


seeking parametric models, which are elastically deformable versions of generalized cylin- 
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ders! (Terzopoulos, Witkin, and Kass 1987). In these models, s is the parameter along the 
spine of the deformable tube and t is the parameter around the tube. A variety of smoothness 
and radial symmetry forces are used to constrain the model while it is fitted to image-based 
silhouette curves. 

It is also possible to define non-parametric surface models, such as general triangulated 
meshes, and to equip such meshes (using finite element analysis) with both internal smooth- 
ness metrics and external data fitting metrics (Sander and Zucker 1990; Fua and Sander 1992; 
Delingette, Hebert, and Ikeuichi 1992; McInerney and Terzopoulos 1993). While most of 
these approaches assume a standard elastic deformation model, which uses quadratic internal 
smoothness terms, it is also possible to use sub-linear energy models to better preserve sur- 
face creases (Diebel, Thrun, and Briinig 2006) or to use graph-convolutional neural networks 
(GCNNs) as an alternative to the update equations, as in Deep Active Surface Models (Wick- 
ramasinghe, Fua, and Knott 2021). Triangle meshes can also be augmented with either spline 
elements (Sullivan and Ponce 1998) or subdivision surfaces (Stollnitz, DeRose, and Salesin 
1996; Zorin, Schröder, and Sweldens 1996; Warren and Weimer 2001; Peters and Reif 2008) 
to produce surfaces with better smoothness control. 

Both parametric and non-parametric surface models assume that the topology of the sur- 
face is known and fixed ahead of time. For more flexible surface modeling, we can either rep- 
resent the surface as a collection of oriented points (Section 13.4) or use 3D implicit functions 
(Section 13.5.1), which can also be combined with elastic 3D surface models (McInerney and 
Terzopoulos 1993). 

The field of surface reconstruction from unorganized point samples continues to advance 
rapidly, with more recent work addressing issues with data imperfections, as described in the 
survey by Berger, Tagliasacchi et al. (2017) . 


13.3.2 Surface simplification 


Once a triangle mesh has been created from 3D data, it is often desirable to create a hierarchy 
of mesh models, for example, to control the displayed level of detail (LOD) in a computer 
graphics application. (In essence, this is a 3D analog to image pyramids (Section 3.5).) One 
approach to doing this is to approximate a given mesh with one that has subdivision connec- 
tivity, over which a set of triangular wavelet coefficients can then be computed (Eck, DeRose 
et al. 1995). A more continuous approach is to use sequential edge collapse operations to 


go from the original fine-resolution mesh to a coarse base-level mesh (Hoppe 1996; Lee, 


7A generalized cylinder (Brooks 1981) is a solid of revolution, i.e., the result of rotating a (usually smooth) curve 
around an axis. It can also be generated by sweeping a slowly varying circular cross-section along the axis. (These 


two interpretations are equivalent.) 
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(a) (b) (c) (d) 


Figure 13.15 Progressive mesh representation of an airplane model (Hoppe 1996) © 1996 
ACM: (a) base mesh M? (150 faces); (b) mesh MY? (500 faces); (c) mesh M* (1,000 
faces); (d) original mesh M = M” (13,546 faces). 


(x,y, 2) (Ng, Ny, Nz) 


(a) (b) (c) 


Figure 13.16 Geometry images (Gu, Gortler, and Hoppe 2002) O 2002 ACM: (a) the 257 
x 257 geometry image defines a mesh over the surface; (b) the 512 x 512 normal map defines 


vertex normals; (c) final lit 3D model. 


Sweldens et al. 1998). The resulting progressive mesh (PM) representation can be used to 
render the 3D model at arbitrary levels of detail, as shown in Figure 13.15. More recent 
papers on multiresolution geometric modeling can be found in the survey by Floater and 
Hormann (2005) and the collection of papers edited by Dodgson, Floater, and Sabin (2005). 


13.3.3 Geometry images 


While multi-resolution surface representations such as Eck, DeRose et al. (1995), Hoppe 
(1996), and Lee, Sweldens et al. (1998) support level of detail operations, they still consist of 
an irregular collection of triangles, which makes them more difficult to compress and store in 


a cache-efficient manner.® 


8Subdivision triangulations, such as those in Eck, DeRose et al. (1995), are semi-regular, i.e., regular (ordered 


and nested) within each subdivided base triangle. 
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To make the triangulation completely regular (uniform and gridded), Gu, Gortler, and 
Hoppe (2002) describe how to create geometry images by cutting surface meshes along well- 
chosen lines and “flattening” the resulting representation into a square. Figure 13.16a shows 
the resulting (x, y, z) values of the surface mesh mapped over the unit square, while Fig- 
ure 13.16b shows the associated (nz, Ny, nz) normal map, i.e., the surface normals associ- 
ated with each mesh vertex, which can be used to compensate for loss in visual fidelity if the 


original geometry image is heavily compressed. 


13.4 Point-based representations 


As we mentioned previously, triangle-based surface models assume that the topology (and 
often the rough shape) of the 3D model is known ahead of time. While it is possible to 
re-mesh a model as it is being deformed or fitted, a simpler solution is to dispense with an 
explicit triangle mesh altogether and to have triangle vertices behave as oriented points, or 
particles, or surface elements (surfels) (Szeliski and Tonnesen 1992). 

To endow the resulting particle system with internal smoothness constraints, pairwise in- 
teraction potentials can be defined that approximate the equivalent elastic bending energies 
that would be obtained using local finite-element analysis.? Instead of defining the finite 
element neighborhood for each particle (vertex) ahead of time, a soft influence function is 
used to couple nearby particles. The resulting 3D model can change both topology and par- 
ticle density as it evolves and can therefore be used to interpolate partial 3D data with holes 
(Szeliski, Tonnesen, and Terzopoulos 1993b). Discontinuities in both the surface orientation 
and crease curves can also be modeled (Szeliski, Tonnesen, and Terzopoulos 1993a). 

To render the particle system as a continuous surface, local dynamic triangulation heuris- 
tics (Szeliski and Tonnesen 1992) or direct surface element splatting (Pfister, Zwicker et al. 
2000) can be used. Another alternative is to first convert the point cloud into an implicit signed 
distance or inside—outside function, using either minimum signed distances to the oriented 
points (Hoppe, DeRose et al. 1992) or by interpolating a characteristic (inside—outside) func- 
tion using radial basis functions (Turk and O’Brien 2002; Dinh, Turk, and Slabaugh 2002). 
Even greater precision over the implicit function fitting, including the ability to handle irreg- 
ular point densities, can be obtained by computing a moving least squares (MLS) estimate of 
the signed distance function (Alexa, Behr et al. 2003; Pauly, Keiser et al. 2003), as shown 
in Figure 13.17. Further improvements can be obtained using local sphere fitting (Guen- 


nebaud and Gross 2007), faster and more accurate re-sampling (Guennebaud, Germann, and 


2 As mentioned before, an alternative is to use sub-linear interaction potentials, which encourage the preservation 


of surface creases (Diebel, Thrun, and Briinig 2006). 
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(d) (e) 


Figure 13.17  Point-based surface modeling with moving least squares (MLS) (Pauly, 
Keiser et al. 2003) O 2003 ACM: (a) a set of points (black dots) is turned into an implicit 
inside—outside function (black curve); (b) the signed distance to the nearest oriented point 
can serve as an approximation to the inside-outside distance; (c) a set of oriented points 
with variable sampling density representing a 3D surface (head model); (d) local estimate of 
sampling density, which is used in the moving least squares; (e) reconstructed continuous 3D 


surface. 


Gross 2008), and kernel regression to better tolerate outliers (Oztireli, Guennebaud, and Gross 
2008). 

The survey by Berger, Tagliasacchi et al. (2017) discusses more recent work on re- 
constructing smooth complete surfaces from point clouds. The SurfelMeshing paper by 
Schóps, Sattler, and Pollefeys (2020) presents an RGB-D SLAM system based on a variable- 
resolution surfel representation that gets re-triangulated as more scans are integrated. Other 
recent approaches to 3D point clouds that use deep learning, mentioned previously in Sec- 
tion 5.5.1, are discussed in the survey by Guo, Wang et al. (2020). Even more recent algo- 
rithms to estimate better normals in 3D models are presented in Ben-Shabat and Gould (2020) 
and Zhu and Smith (2020). 


13.5 Volumetric representations 


A third alternative for modeling 3D surfaces is to construct 3D volumetric inside—outside 
functions. We have already seen examples of this in Section 12.7.2, where we looked at 
voxel coloring (Seitz and Dyer 1999), space carving (Kutulakos and Seitz 2000), and level 
set (Pons, Keriven, and Faugeras 2007) techniques for stereo matching, and Section 12.7.3, 
where we discussed using binary silhouette images to reconstruct volumes. 

In this section, we look at continuous implicit (inside—outside) functions to represent 3D 


shape. 
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13.5.1 Implicit surfaces and level sets 


While polyhedral and voxel-based representations can represent three-dimensional shapes to 
an arbitrary precision, they lack some of the intrinsic smoothness properties available with 
continuous implicit surfaces, which use an indicator function (or characteristic function) 
F(x,y,z) to indicate which 3D points are inside F(x, y,z) < 0 or outside F(x,y,z) > 0 
the object. 

An early example of using implicit functions to model 3D objects in computer vision were 
superquadrics (Pentland 1986; Solina and Bajcsy 1990; Waithe and Ferrie 1991; Leonardis, 
Jaklič, and Solina 1997). To model a wider variety of shapes, superquadrics are usually com- 
bined with either rigid or non-rigid deformations (Terzopoulos and Metaxas 1991; Metaxas 
and Terzopoulos 2002). Superquadric models can either be fitted to range data or used di- 
rectly for stereo matching. 

A different kind of implicit shape model can be constructed by defining a signed distance 
function over a regular three-dimensional grid, optionally using an octree spline to represent 
this function more coarsely away from its surface (zero-set) (Lavallée and Szeliski 1995; 
Szeliski and Lavallée 1996; Frisken, Perry et al. 2000; Ohtake, Belyaev et al. 2003). We 
have already seen examples of signed distance functions being used to represent distance 
transforms (Section 3.3.3), level sets for 2D contour fitting and tracking (Section 7.3.2), volu- 
metric stereo (Section 12.7.2), range data merging (Section 13.2.1), and point-based modeling 
(Section 13.4). The advantage of representing such functions directly on a grid is that it is 
quick and easy to look up distance function values for any (x, y, z) location and also easy to 
extract the isosurface using the marching cubes algorithm (Lorensen and Cline 1987). The 
work of Ohtake, Belyaev et al. (2003) is particularly notable, as it allows for several distance 
functions to be used simultaneously and then combined locally to produce sharp features such 
as creases. 

Poisson surface reconstruction (Kazhdan, Bolitho, and Hoppe 2006; Kazhdan and Hoppe 
2013) uses a closely related volumetric function, namely a smoothed 0/1 inside—outside (char- 
acteristic or occupancy) function, which can be thought of as a clipped signed distance func- 
tion. The gradients for this function are set to lie along oriented surface normals near known 
surface points and 0 elsewhere. The function itself is represented using a quadratic tensor- 
product B-spline over an octree, which provides a compact representation with larger cells 
away from the surface or in regions of lower point density, and also admits the efficient solu- 
tion of the related Poisson equations (4.24—4.27), e.g., Section 8.4.4 and Pérez, Gangnet, and 
Blake (2003). 

It is also possible to replace the quadratic penalties used in the Poisson equations with 


L; (total variation) constraints and still obtain a convex optimization problem, which can be 
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Single-view || 


Figure 13.18 A Pixel-aligned Implicit Function (PIFu) network can recover a high- 
resolution 3D textured model of a clothed human from a single input image (Saito, Huang 
et al. 2019) O 2019 IEEE. 


solved using either continuous (Zach, Pock, and Bischof 2007b; Zach 2008) or discrete graph 
cut (Lempitsky and Boykov 2007) techniques. 

Signed distance functions also play an integral role in level-set evolution equations (Sec- 
tions 7.3.2 and 12.7.2), where the values of distance transforms on the mesh are updated as 
the surface evolves to fit multi-view stereo photoconsistency measures (Faugeras and Keriven 
1998). 

As with many other areas of computer vision, deep neural networks have started be- 
ing applied to the construction and modeling of volumetric object representations. Some 
neural networks construct 3D surface or volumetric occupancy grid models from single im- 
ages (Choy, Xu et al. 2016; Tatarchenko, Dosovitskiy, and Brox 2017; Groueix, Fisher et 
al. 2018; Richter and Roth 2018), although more recent experiments suggest that these net- 
works may just be recognizing the general object category and doing a small amount of 
fitting (Tatarchenko, Richter et al. 2019). DeepSDFs (Park, Florence et al. 2019), IM-NET 
(Chen and Zhang 2019), Occupancy Networks (Mescheder, Oechsle ef al. 2019), Deep Im- 
plicit Surface (DISN) networks (Xu, Wang et al. 2019), and UCLID-Net (Guillard, Remelli, 
and Fua 2020) train networks to transform continuous (x, y, z) inputs into signed distance 
or [0, 1] occupancy values and sometimes combine convolutional image encoders with MLPs 
to represent color and surface details (Oechsle, Mescheder et al. 2019), while MeshSDF can 
continuously transform SDFs into deformable meshes (Remelli, Lukoianov et al. 2020). All 
of these networks use latent codes to represent individual instances from a generic class (e.g., 
car or chair) from the ShapeNet dataset (Chang, Funkhouser et al. 2015), although they use 
the codes in a different part of the network (either in the input or through conditional batch 
normalization). This allows them to reconstruct 3D models from just a single image. 

Pixel-aligned Implicit function (PIFu) networks combine fully convolutional image fea- 


tures with neural implicit functions to better preserve local shape and color details (Saito, 
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Huang et al. 2019; Saito, Simon et al. 2020). They are trained specifically on clothed humans 
and can hallucinate full 3D models from just a single color image (Figure 13.18). Neural 
Radiance Fields (NeRF) extend this to also use pixel ray directions as inputs and also output 
continuous valued opacities and radiance values, enabling ray-traced rendering of shiny 3D 
models constructed from multiple input images (Mildenhall, Srinivasan et al. 2020). This 
representation is related to Lumigraphs and Surface Light Fields, which we study in Sec- 
tion 14.3. Both of these systems are examples of neural rendering approaches to generating 
photorealistic novel views, which we discuss in more detail in Section 14.6. 

To deal with larger (e.g., building-scale) scenes, Convolutional Occupancy Networks 
(Peng, Niemeyer et al. 2020) first retrieve local features from a 2D, multiplane, or 3D grid, 
and then use a trained MLP (fully connected network) to decode these into local occupancy 
volumes. Instead of modeling a complete 3D scene, Local Implicit Grid Representations 
(Jiang, Sud et al. 2020) model small local sub-volumes, allowing them to be used as a kind 


of prior for other shape reconstruction methods. 


13.6 Model-based reconstruction 


When we know something ahead of time about the objects we are trying to model, we can 
construct more detailed and reliable 3D models using specialized techniques and representa- 
tions. For example, architecture is usually made up of large planar regions and other para- 
metric forms (such as surfaces of revolution), usually oriented perpendicular to gravity and to 
each other (Section 13.6.1). Heads and faces can be represented using low-dimensional, non- 
rigid shape models, because the variability in shape and appearance of human faces, while 
extremely large, is still bounded (Section 13.6.2). Human bodies or parts, such as hands, form 
highly articulated structures, which can be represented using kinematic chains of piecewise 
rigid skeletal elements linked by joints (Section 13.6.4). 

In this section, we highlight some of the main ideas, representations, and modeling algo- 
rithms used for these three cases. Additional details and references can be found in special- 
ized conferences and workshops devoted to these topics, e.g., the International Conference 
on 3D Vision (3DV) and the IEEE International Conference on Automatic Face and Gesture 
Recognition (FG). 


13.6.1 Architecture 


Architectural modeling, especially from aerial photography, has been one of the longest stud- 


ied problems in both photogrammetry and computer vision (Walker and Herman 1988). In the 
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Figure 13.19 Interactive architectural modeling using the Facade system (Debevec, Taylor, 
and Malik 1996) O 1996 ACM: (a) input image with user-drawn edges shown in green; (b) 
shaded 3D solid model; (c) geometric primitives overlaid onto the input image; (d) final 


view-dependent, texture-mapped 3D model. 


last two decades, the development of reliable image-based modeling techniques, as well as 
the prevalence of digital cameras and 3D computer games, has led to widespread deployment 
of such systems. 


The work by Debevec, Taylor, and Malik (1996) was one of the earliest hybrid geometry- 
and image-based modeling and rendering systems. Their Fagade system combines an inter- 
active image-guided geometric modeling tool with model-based (local plane plus parallax) 
stereo matching and view-dependent texture mapping. During the interactive photogrammet- 
ric modeling phase, the user selects block elements and aligns their edges with visible edges 
in the input images (Figure 13.19a). The system then automatically computes the dimensions 
and locations of the blocks along with the camera positions using constrained optimization 
(Figure 13.19b-c). This approach is intrinsically more reliable than general feature-based 
structure from motion, because it exploits the strong geometry available in the block primi- 
tives. Related work by Becker and Bove (1995), Horry, Anjyo, and Arai (1997), Criminisi, 
Reid, and Zisserman (2000), and Holynski, Geraghty et al. (2020) exploits similar informa- 
tion available from vanishing points. In the interactive, image-based modeling system of 
Sinha, Steedly et al. (2008), vanishing point directions are used to guide the user drawing of 
polygons, which are then automatically fitted to sparse 3D points recovered using structure 
from motion. 


Once the rough geometry has been estimated, more detailed offset maps can be computed 
for each planar face using a local plane sweep, which Debevec, Taylor, and Malik (1996) call 
model-based stereo. Finally, during rendering, images from different viewpoints are warped 
and blended together as the camera moves around the scene, using a process (related to light 
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Y == 
(a) 


(b) 
Figure 13.20 Interactive 3D modeling from panoramas (Shum, Han, and Szeliski 1998) 


© 1998 IEEE: (a) wide-angle view of a panorama with user-drawn vertical and horizontal 
(axis-aligned) lines; (b) single-view reconstruction of the corridors. 


field and Lumigraph rendering; see Section 14.3) called view-dependent texture mapping 
(Figure 13.19d). 


For interior modeling, instead of working with single pictures, it is more useful to work 
with panoramas, as you can see larger extents of walls and other structures. The 3D modeling 
system developed by Shum, Han, and Szeliski (1998) first constructs calibrated panoramas 
from multiple images (Section 11.4.2) and then has the user draw vertical and horizontal 
lines in the image to demarcate the boundaries of planar regions. The lines are initially used 
to establish an absolute rotation for each panorama and are later used (along with the inferred 
vertices and planes) to optimize the 3D structure, which can be recovered up to scale from one 
or more images (Figure 13.20). Recent advances in deep networks now make it possible to 
both automatically infer the lines and their junctions (Huang, Wang et al. 2018; Zhang, Li et 
al. 2019) and to build complete 3D wireframe models (Zhou, Qi, and Ma 2019; Zhou, Qi etal. 
2019b). 360° high dynamic range panoramas can also be used for outdoor modeling, because 
they provide highly reliable estimates of relative camera orientations as well as vanishing 
point directions (Antone and Teller 2002; Teller, Antone et al. 2003). 


While earlier image-based modeling systems required some user authoring, Werner and 
Zisserman (2002) present a fully automated line-based reconstruction system. As described 
in Section 11.4.8, they first detect lines and vanishing points and use them to calibrate the 
camera; then they establish line correspondences using both appearance matching and trifocal 
tensors, which enables them to reconstruct families of 3D line segments. They then generate 
plane hypotheses, using both co-planar 3D lines and a plane sweep (Section 12.1.2) based 
on cross-correlation scores evaluated at interest points. Intersections of planes are used to 
determine the extent of each plane, i.e., an initial coarse geometry, which is then refined with 
the addition of rectangular or wedge-shaped indentations and extrusions. Note that when 
top-down maps of the buildings being modeled are available, these can be used to further 
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Figure 13.21 Automated architectural reconstruction using 3D lines and planes (Sinha, 
Steedly, and Szeliski 2009) © 2009 IEEE. 


constrain the 3D modeling process (Robertson and Cipolla 2002, 2009). The idea of using 
matched 3D lines for estimating vanishing point directions and dominant planes is used in 
a number of fully automated image-based architectural modeling systems (Zebedin, Bauer 
et al. 2008; Micusík and Kosecká 2009; Furukawa, Curless et al. 2009b; Sinha, Steedly, 
and Szeliski 2009; Holynski, Geraghty et al. 2020) as well as SLAM systems (Zhou, Zou 
et al. 2015; Li, Yao et al. 2018; Yang and Scherer 2019). Figure 13.21 shows some of the 
processing stages in the system developed by Sinha, Steedly, and Szeliski (2009). 

Another common characteristic of architecture is the repeated use of primitives such as 
windows, doors, and colonnades. Architectural modeling systems can be designed to search 
for such repeated elements and to use them as part of the structure inference process (Dick, 
Torr, and Cipolla 2004; Mueller, Zeng et al. 2007; Schindler, Krishnamurthy et al. 2008; 
Pauly, Mitra et al. 2008; Sinha, Steedly et al. 2008). The combination of structured elements 
such as parallel lines, junctions, and rectangles with full axis-aligned 3D models for the 
modeling of architectural environments has recently been called holistic 3D reconstruction. 
More details can be found in the recent tutorial by Zhou, Furukawa, and Ma (2019), workshop 
(Zhou, Furukawa et al. 2020), and state-of-the-art report by Pintore, Mura et al. (2020). 

The combination of all these techniques now makes it possible to reconstruct the struc- 
ture of large 3D scenes (Zhu and Kanade 2008). For example, the Urbanscan system of 
Pollefeys, Nistér et al. (2008) reconstructs texture-mapped 3D models of city streets from 
videos acquired with a GPS-equipped vehicle. To obtain real-time performance, they use 
both optimized online structure-from-motion algorithms, as well as GPU implementations of 
plane-sweep stereo aligned to dominant planes and depth map fusion. Cornelis, Leibe et al. 
(2008) present a related system that also uses plane-sweep stereo (aligned to vertical build- 
ing façades) combined with object recognition and segmentation for vehicles. MicuSik and 
KoSecka (2009) build on these results using omni-directional images and superpixel-based 


stereo matching along dominant plane orientations. Reconstruction directly from active range 
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(b) (d) 


Figure 13.22 3D model fitting to a collection of images: (Pighin, Hecker et al. 1998) O 
1998 ACM: (a) set of five input images along with user-selected keypoints; (b) the complete 
set of keypoints and curves; (c) three meshes—the original, adapted after 13 keypoints, and 
after an additional 99 keypoints; (d) the partition of the image into separately animatable 


regions. 


scanning data combined with color imagery that has been compensated for exposure and 
lighting variations is also possible (Chen and Chen 2008; Stamos, Liu et al. 2008; Troccoli 
and Allen 2008). 


Numerous photogrammetric reconstruction systems that produce detailed texture-mapped 
3D models have been developed based on these computer vision techniques.'% Examples 
of commercial software that can be used to reconstruct large-scale 3D models from aerial 
drone and ground level photography include Pix4D,'! Metashape,!? and RealityCapture.!* 
Another example is Occipital’s Canvas mobile phone app'* (Stein 2020), which appears to 
use a combination of photogrammetry (3D point and line matching and reconstruction, as 
discussed above) and depth map fusion. 
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13.6.2 Facial modeling and tracking 


Another area in which specialized shape and appearance models are extremely helpful is 
in the modeling of heads and faces. Even though the appearance of people seems at first 
glance to be infinitely variable, the actual shape of a person’s head and face can be described 
reasonably well using a few dozen parameters (Pighin, Hecker et al. 1998; Guenter, Grimm 
et al. 1998; DeCarlo, Metaxas, and Stone 1998; Blanz and Vetter 1999; Shan, Liu, and Zhang 
2001; Zollhófer, Thies et al. 2018; Egger, Smith et al. 2020). 

Figure 13.22 shows an example of an image-based modeling system, where user-specified 
keypoints in several images are used to fit a generic head model to a person's face. As you 
can see in Figure 13.22c, after specifying just over 100 keypoints, the shape of the face has 
become quite adapted and recognizable. Extracting a texture map from the original images 
and then applying it to the head model results in an animatable model with striking visual 
fidelity (Figure 13.23a). 

A more powerful system can be built by applying principal component analysis (PCA) to 
a collection of 3D scanned faces, which is a topic we discuss in Section 13.6.3. As you can 
see in Figure 13.25, it is then possible to fit morphable 3D models to single images and to 
use such models for a variety of animation and visual effects (Blanz and Vetter 1999; Egger, 
Smith et al. 2020). It is also possible to design stereo matching algorithms that optimize 
directly for the head model parameters (Shan, Liu, and Zhang 2001; Kang and Jones 2002) 
or to use the output of real-time stereo with active illumination (Zhang, Snavely et al. 2004) 
(Figures 13.10 and 13.23b). 

As the sophistication of 3D facial capture systems evolved, so did the detail and realism 
in the reconstructed models. Modern systems can capture (in real-time) not only surface 
details such as wrinkles and creases, but also accurate models of skin reflection, translucency, 
and sub-surface scattering (Debevec, Hawkins et al. 2000; Weyrich, Matusik et al. 2006; 
Golovinskiy, Matusik et al. 2006; Bickel, Botsch et al. 2007; Igarashi, Nishino, and Nayar 
2007; Meka, Haene et al. 2019). 

Once a 3D head model has been constructed, it can be used in a variety of applications, 
such as head tracking (Toyama 1998; Lepetit, Pilet, and Fua 2004; Matthews, Xiao, and 
Baker 2007), as shown in Figures 7.30 and face transfer, i.e., replacing one person’s face 
with another in a video (Bregler, Covell, and Slaney 1997; Vlasic, Brand et al. 2005). Addi- 


tional applications include face beautification by warping face images toward a more attrac- 


!Ohttps://all3dp.com/1/best-photogrammetry-software 
H https://www.pix4d.com 

Bhttps://www.agisoft.com/ 

‘3 https://www.capturingreality.com/ 
Mhttps://canvas.io 
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(b) 


Figure 13.23 Head and expression tracking and re-animation using deformable 3D mod- 
els. (a) Models fitted directly to five input video streams (Pighin, Szeliski, and Salesin 2002) O 
2002 Springer: The bottom row shows the results of re-animating a synthetic texture-mapped 
3D model with pose and expression parameters fitted to the input images in the top row. (b) 
Models fitted to frame-rate spacetime stereo surface models (Zhang, Snavely et al. 2004) O 
2004 ACM: The top row shows the input images with synthetic green markers overlaid, while 
the bottom row shows the fitted 3D surface model. 


tive “standard” (Leyvand, Cohen-Or et al. 2008), face de-identification for privacy protection 
(Gross, Sweeney et al. 2008), and face swapping (Bitouk, Kumar et al. 2008). 

More recent applications of 3D head models include photorealistic avatars for video con- 
ferencing (Chu, Ma et al. 2020), 3D unwarping for better selfies (Fried, Shechtman et al. 
2016; Zhao, Huang et al. 2019; Ma, Lin et al. 2020), and single image portrait relighting 
(Sun, Barron et al. 2019; Zhou, Hadap et al. 2019; Zhang, Barron et al. 2020), an example of 
which is shown in Figure 13.24. This last application is available as the Portrait Light feature 
in Google Photos.'* Additional applications can be found in the survey papers by Zollhófer, 
Thies et al. (2018) and Egger, Smith et al. (2020). 


13.6.3 Application: Facial animation 


Perhaps the most widely used application of 3D head modeling is facial animation (Zollhófer, 
Thies ef al. 2018). Once a parameterized 3D model of the shape and appearance (surface 
texture) of a person’s head has been constructed, it can be used directly to track a person’s 


facial motions (Figure 13.23a) and to animate a different character with these same motions 


IShttps://blog.google/products/photos/new-helpful-editor 
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Figure 13.24 Portrait shadow removal and manipulation (Zhang, Barron et al. 2020) O 
2020 ACM. The top row shows the original photographs and the bottom row the correspond- 
ing enhanced photographs after more flattering lighting has been simulated. 


and expressions (Pighin, Szeliski, and Salesin 2002). 

An improved version of such a system can be constructed by first applying principal 
component analysis (PCA) to the space of possible head shapes and facial appearances. Blanz 
and Vetter (1999) describe a system where they first capture a set of 200 colored range scans 
of faces (Figure 13.25a), which can be represented as a large collection of (X,Y, Z, R, G, B) 
samples (vertices).!* For 3D morphing to be meaningful, corresponding vertices in different 
people’s scans must first be put into correspondence (Pighin, Hecker et al. 1998). Once 
this is done, PCA can be applied to more naturally parameterize the 3D morphable model. 
The flexibility of this model can be increased by performing separate analyses in different 
subregions, such as the eyes, nose, and mouth, just as in modular eigenspaces (Moghaddam 
and Pentland 1997). 

After computing a subspace representation, different directions in this space can be as- 
sociated with different characteristics such as gender, facial expressions, or facial features 
(Figure 13.25a). As in the work of Rowland and Perrett (1995), faces can be turned into 
caricatures by exaggerating their displacement from the mean image. 

3D morphable models can be fitted to a single image using gradient descent on the error 
between the input image and the re-synthesized model image, after an initial manual place- 
ment of the model in an approximately correct pose, scale, and location (Figures 13.25b-c). 


The efficiency of this fitting process can be increased using inverse compositional image 


16 A cylindrical coordinate system provides a natural two-dimensional embedding for this collection, but such an 


embedding is not necessary to perform PCA. 
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Figure 13.25 3D morphable face model (Blanz and Vetter 1999) © 1999 ACM: (a) orig- 
inal 3D face model with the addition of shape and texture variations in specific directions: 
deviation from the mean (caricature), gender, expression, weight, and nose shape; (b) a 3D 
morphable model is fitted to a single image, after which its weight or expression can be 
manipulated; (c) another example of a 3D reconstruction along with a different set of 3D 
manipulations, such as lighting and pose change. 
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Figure 13.26 A timeline of twenty years of 3D morphable head models (Egger, Smith et al. 
2020) © 2020 ACM, including results from the original paper by Blanz and Vetter (1999), the 
first publicly available morphable model (Paysan, Knothe et al. 2009), facial re-enactment 
results (Kim, Garrido et al. 2018), and GAN-based models (Gecer, Ploumpis et al. 2019). 


alignment (Baker and Matthews 2004) as described by Romdhani and Vetter (2003). 

The resulting texture-mapped 3D model can then be modified to produce a variety of vi- 
sual effects, including changing a person’s weight or expression, or three-dimensional effects 
such as re-lighting or 3D video-based animation (Section 14.5.1). Such models can also be 
used for video compression, e.g., by only transmitting a small number of facial expression 
and pose parameters to drive a synthetic avatar (Eisert, Wiegand, and Girod 2000; Gao, Chen 
et al. 2003; Lombardi, Saragih et al. 2018; Wei, Saragih et al. 2019) or to bring a still portrait 
image to life (Averbuch-Elor, Cohen-Or et al. 2017). The survey paper on 3D morphable 
face models by Egger, Smith et al. (2020) (Figure 13.26) discusses additional research and 
applications in this area. 

3D facial animation is often matched to the performance of an actor, in what is known 
as performance-driven animation (Section 7.1.6) (Williams 1990). Traditional performance- 
driven animation systems use marker-based motion capture (Ma, Jones et al. 2008), while 
some newer systems use depth cameras or regular video to control the animation (Buck, 
Finkelstein et al. 2000; Pighin, Szeliski, and Salesin 2002; Zhang, Snavely et al. 2004; Vlasic, 
Brand et al. 2005; Weise, Bouaziz et al. 2011; Thies, Zollhofer et al. 2016; Thies, Zollhöfer 
et al. 2018). 

An example of the latter approach is the system developed for the film The Curious Case 
of Benjamin Button, in which Digital Domain used the CONTOUR system from Mova!” to 
capture actor Brad Pitt’s facial motions and expressions (Roble and Zafar 2009). CONTOUR 
uses a combination of phosphorescent paint and multiple high-resolution video cameras to 
capture real-time 3D range scans of the actor. These 3D models were then translated into 
Facial Action Coding System (FACS) shape and expression parameters (Ekman and Friesen 
1978) to drive a different (older) synthetically animated computer-generated imagery (CGI) 


‘7 http://www.mova.com. 
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character. More recent examples of performance-driven facial animation can be found in the 
state of the art report by Zollhófer, Thies et al. (2018). 


13.6.4 Human body modeling and tracking 


The topics of tracking humans, modeling their shape and appearance, and recognizing their 
activities, are some of the most actively studied areas of computer vision. Annual confer- 
ences!® and special journal issues (Hilton, Fua, and Ronfard 2006) are devoted to this sub- 
ject, and two surveys (Forsyth, Arikan et al. 2006; Moeslund, Hilton, and Kriiger 2006) each 
list over 400 papers devoted to these topics.'? The HumanEva database of articulated human 
motions contains multi-view video sequences of human actions along with corresponding 
motion capture data, evaluation code, and a reference 3D tracker based on particle filtering. 
The companion paper by Sigal, Balan, and Black (2010) not only describes the database 
and evaluation but also has a nice survey of important work in this field. The more recent 
MPI FAUST dataset (Bogo, Romero et al. 2014) has 300 real, high-resolution human scans 
with automatically computed ground-truth correspondences, while the even newer AMASS 
dataset (Mahmood, Ghorbani ef al. 2019) has more than 40 hours of motion data, spanning 
over 300 subjects and 11,000 motions.2° 

Given the breadth of this area, it is difficult to categorize all of this research, especially as 
different techniques usually build on each other. Moeslund, Hilton, and Kriiger (2006) divide 
their survey into initialization, tracking (which includes background modeling and segmenta- 
tion), pose estimation, and action (activity) recognition. Forsyth, Arikan et al. (2006) divide 
their survey into sections on tracking (background subtraction, deformable templates, flow, 
and probabilistic models), recovering 3D pose from 2D observations, and data association 
and body parts. They also include a section on motion synthesis, which is more widely stud- 
ied in computer graphics (Arikan and Forsyth 2002; Kovar, Gleicher, and Pighin 2002; Lee, 
Chai et al. 2002; Li, Wang, and Shum 2002; Pullen and Bregler 2002): see Section 14.5.2. 
Another potential taxonomy for work in this field would be along the lines of whether 2D 
or 3D (or multi-view) images are used as input and whether 2D or 3D kinematic models are 
used. 

In this section, we briefly review some of the more seminal and widely cited papers in the 


areas of background subtraction, initialization and detection, tracking with flow, 3D kinematic 


'8Tnternational Conference on Automatic Face and Gesture Recognition (FG) and IEEE Workshop on Analysis 
and Modeling of Faces and Gestures (AMFG). 

'9Qlder surveys include those by Gavrila (1999) and Moeslund and Granum (2001). Some surveys on gesture 
recognition, which we do not cover in this book, include those by Pavlovié, Sharma, and Huang (1997) and Yang, 
Ahuja, and Tabb (2002). 

20 Additional datasets from the MPI Perceiving Systems group can be found at https://ps.is.mpg.de/code. 
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models, probabilistic models, adaptive shape modeling, and activity recognition. We refer the 


reader to the previously mentioned surveys for other topics and more details. 


Background subtraction. One of the first steps in many human tracking systems is to 
model the background to extract the moving foreground objects (silhouettes) corresponding 
to people. Toyama, Krumm et al. (1999) review several difference matting and background 
maintenance (modeling) techniques and provide a good introduction to this topic. Stauffer 
and Grimson (1999) describe some techniques based on mixture models, while Sidenbladh 
and Black (2003) develop a more comprehensive treatment, which models not only the back- 
ground image statistics but also the appearance of the foreground objects, e.g., their edge and 
motion (frame difference) statistics. More recent techniques for video background matting, 
such as those of Sengupta, Jayaram et al. (2020) and Lin, Ryabtsev et al. (2021) are discussed 
in Section 10.4.5 on video matting. 

Once silhouettes have been extracted from one or more cameras, they can then be mod- 
eled using deformable templates or other contour models (Baumberg and Hogg 1996; Wren, 
Azarbayejani et al. 1997). Tracking such silhouettes over time supports the analysis of multi- 
ple people moving around a scene, including building shape and appearance models and de- 
tecting if they are carrying objects (Haritaoglu, Harwood, and Davis 2000; Mittal and Davis 
2003; Dimitrijevic, Lepetit, and Fua 2006). 


Initialization and detection. To track people in a fully automated manner, it is necessary to 
first detect (or re-acquire) their presence in individual video frames. This topic is closely re- 
lated to pedestrian detection, which is often considered as a kind of object recognition (Mori, 
Ren et al. 2004; Felzenszwalb and Huttenlocher 2005; Felzenszwalb, McAllester, and Ra- 
manan 2008; Dollar, Wojek et al. 2012; Dollar, Appel et al. 2014; Sermanet, Kavukcuoglu et 
al. 2013; Ouyang and Wang 2013; Tian, Luo et al. 2015; Zhang, Lin et al. 2016), and is there- 
fore treated in more depth in Section 6.3.2. Additional techniques for initializing 3D trackers 
based on 2D images include those described by Howe, Leventon, and Freeman (2000), Ros- 
ales and Sclaroff (2000), Shakhnarovich, Viola, and Darrell (2003), Sminchisescu, Kanaujia 
et al. (2005), Agarwal and Triggs (2006), Lee and Cohen (2006), Sigal and Black (2006b), 
and Stenger, Thayananthan et al. (2006). 

Single-frame human detection and pose estimation algorithms can be used by themselves 
to perform tracking (Ramanan, Forsyth, and Zisserman 2005; Rogez, Rihan et al. 2008; Bour- 
dev and Malik 2009; Giiler, Neverova, and Kokkinos 2018; Cao, Hidalgo et al. 2019), as 
described in Section 6.3.2 (Figure 6.25) and Section 6.4.5 (Figure 6.42-6.43). They are of- 
ten combined, however, with frame-to-frame tracking techniques to provide better reliability 
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(a) (b) 


(d) 


Figure 13.27 Tracking 3D human motion: (a) kinematic chain model for a human hand 
(Rehg, Morris, and Kanade 2003) O 2003, reprinted by permission of SAGE; (b) tracking a 
kinematic chain blob model in a video sequence (Bregler, Malik, and Pullen 2004) O 2004 
Springer; (c-d) probabilistic loose-limbed collection of body parts (Sigal, Bhatia et al. 2004) 
© 2004 IEEE. 


(Fossati, Dimitrijevic et al. 2007; Andriluka, Roth, and Schiele 2008; Ferrari, Marin-Jimenez, 
and Zisserman 2008). 


Tracking with flow. The tracking of people and their pose from frame to frame can be 
enhanced by computing optical flow or matching the appearance of their limbs from one 
frame to another. For example, the cardboard people model of Ju, Black, and Yacoob (1996) 
models the appearance of each leg portion (upper and lower) as a moving rectangle, and uses 
optical flow to estimate their location in each subsequent frame. Cham and Rehg (1999) 
and Sidenbladh, Black, and Fleet (2000) track limbs using optical flow and templates, along 
with techniques for dealing with multiple hypotheses and uncertainty. Bregler, Malik, and 
Pullen (2004) use a full 3D model of limb and body motion, as described below. It is also 
possible to match the estimated motion field itself to some prototypes in order to identify 
the particular phase of a running motion or to match two low-resolution video portions to 
perform video replacement (Efros, Berg et al. 2003). Flow-based tracking can also be used to 
track non-rigidly deforming objects such as T-shirts (White, Crane, and Forsyth 2007; Pilet, 
Lepetit, and Fua 2008; Furukawa and Ponce 2008; Salzmann and Fua 2010; Bozic, Zollhófer 
et al. 2020; Božič, Palafox et al. 2020, 2021). It is also possible to use inter-frame motion 
to estimate an evolving textured 3D mesh model of a moving person (de Aguiar, Stoll et al. 
2008). 


3D kinematic models. The effectiveness of human modeling and tracking can be greatly 
enhanced using a more accurate 3D model of a person’s shape and motion. Underlying such 


representations, which are ubiquitous in 3D computer animation in games and special effects, 
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is a kinematic model or kinematic chain, which specifies the length of each limb in a skeleton 
as well as the 2D or 3D rotation angles between the limbs or segments (Figure 13.27a-b). 
Inferring the values of the joint angles from the locations of the visible surface points is 
called inverse kinematics (IK) and is widely studied in computer graphics. 

Figure 13.27a shows the kinematic model for a human hand used by Rehg, Morris, and 
Kanade (2003) to track hand motion ina video. As you can see, the attachment points between 
the fingers and the thumb have two degrees of freedom, while the finger joints themselves 
have only one. Using this kind of model can greatly enhance the ability of an edge-based 
tracker to cope with rapid motion, ambiguities in 3D pose, and partial occlusions. 

One of the biggest advances in reliable real-time hand tracking and modeling was the 
introduction of the Kinect consumer RGB-D camera (Sharp, Keskin et al. 2015; Taylor, Bor- 
deaux et al. 2016), Since then, regular RGB tracking and modeling has also improved signif- 
icantly, with newer techniques using neural networks for reliability and speed (Zimmermann 
and Brox 2017; Mueller, Bernard et al. 2018; Hasson, Varol et al. 2019; Shan, Geng et al. 
2020; Moon, Shiratori, and Lee 2020; Moon, Yu et al. 2020; Spurr, Iqbal et al. 2020; Taheri, 
Ghorbani et al. 2020). Several systems also combine body and hand tracking to more ac- 
curately capture human expressions and activities (Romero, Tzionas, and Black 2017; Joo, 
Simon, and Sheikh 2018; Pavlakos, Choutas et al. 2019; Rong, Shiratori, and Joo 2020). 

In addition to hands, kinematic chain models are even more widely used for whole body 
modeling and tracking (O’ Rourke and Badler 1980; Hogg 1983; Rohr 1994). One popular 
approach is to associate an ellipsoid or superquadric with each rigid limb in the kinematic 
model, as shown in Figure 13.27b. This model can then be fitted to each frame in one or 
more video streams either by matching silhouettes extracted from known backgrounds or by 
matching and tracking the locations of occluding edges (Gavrila and Davis 1996; Kakadiaris 
and Metaxas 2000; Bregler, Malik, and Pullen 2004; Kehl and Van Gool 2006). 

One of the big breakthroughs in real-time skeletal tracking was the introduction of the 
Kinect consumer depth camera for interactive video game control (Shotton, Fitzgibbon et al. 
2011; Taylor, Shotton et al. 2012; Shotton, Girshick et al. 2013) as shown in Figure 13.28. 
In the current landscape of skeletal tracking, some techniques use 2D models coupled to 2D 
measurements, some use 3D measurements (range data or multi-view video) with 3D models 
(Baak, Mueller et al. 2011), and some use monocular video to infer and track 3D models 
directly (Mehta, Sridhar et al. 2017; Habermann, Xu et al. 2019). 

It is also possible to use temporal models to improve the tracking of periodic motions, 
such as walking, by analyzing the joint angles as functions of time (Polana and Nelson 1997; 
Seitz and Dyer 1997; Cutler and Davis 2000). The generality and applicability of such tech- 


niques can be improved by learning typical motion patterns using principal component anal- 
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Inferred body Hypothesized Tracked 
parts joints skeleton 


Figure 13.28 The Kinect skeletal tracking pipeline, which consists of per-pixel body-part 
classification, body joint hypotheses, and then mapping to a skeleton using temporal conti- 
nuity and prior knowledge (Shotton, Girshick et al. 2013). This figure is taken from (Zhang 
2012) O 2012 IEEE. 


ysis (Sidenbladh, Black, and Fleet 2000; Urtasun, Fleet, and Fua 2006). 


Probabilistic models. Because tracking can be such a difficult task, sophisticated proba- 
bilistic inference techniques are often used to estimate the likely states of the person being 
tracked. One popular approach, called particle filtering (Isard and Blake 1998), was origi- 
nally developed for tracking the outlines of people and hands, as described in Section 7.3.1. It 
was subsequently applied to whole-body tracking (Deutscher, Blake, and Reid 2000; Siden- 
bladh, Black, and Fleet 2000; Deutscher and Reid 2005) and continues to be used in modern 
trackers (Ong, Micilotta et al. 2006). Alternative approaches to handling the uncertainty in- 
herent in tracking include multiple hypothesis tracking (Cham and Rehg 1999) and inflated 
covariances (Sminchisescu and Triggs 2001). 

Figure 13.27c-d shows an example of a sophisticated spatio-temporal probabilistic graph- 
ical model called loose-limbed people, which models not only the geometric relationship be- 
tween various limbs, but also their likely temporal dynamics (Sigal, Bhatia et al. 2004). The 
conditional probabilities relating various limbs and time instances are learned from training 


data, and particle filtering is used to perform the final pose inference. 


Adaptive shape modeling. Another essential component of whole body modeling and 
tracking is the fitting of parameterized shape models to visual data. As we saw in Sec- 
tion 13.6.3 (Figure 13.25), the availability of large numbers of registered 3D range scans can 
be used to create morphable models of shape and appearance (Allen, Curless, and Popović 
2003). Building on this work, Anguelov, Srinivasan et al. (2005) develop a sophisticated 
system called SCAPE (Shape Completion and Animation for PEople), which first acquires 


a large number of range scans of different people in varied poses, and then registers these 
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Figure 13.29 Estimating human shape and pose from a single image using a parametric 
3D model (Guan, Weiss et al. 2009) O 2009 IEEE. 


scans using semi-automated marker placement. The registered datasets are used to model the 
variation in shape as a function of personal characteristics and skeletal pose, e.g., the bulging 
of muscles as certain joints are flexed (Figure 13.29, top row). The resulting system can then 
be used for shape completion, 1.e., the recovery of a full 3D mesh model from a small number 
of captured markers, by finding the best model parameters in both shape and pose space that 
fit the measured data. 

Because it is constructed completely from scans of people in close-fitting clothing and 
uses a parametric shape model, the SCAPE system cannot cope with people wearing loose- 
fitting clothing. Bálan and Black (2008) overcome this limitation by estimating the body 
shape that fits within the visual hull of the same person observed in multiple poses, while 
Vlasic, Baran et al. (2008) adapt an initial surface mesh fitted with a parametric shape model 
to better match the visual hull. 

While the preceding body fitting and pose estimation systems use multiple views to es- 
timate body shape, Guan, Weiss et al. (2009) fit a human shape and pose model to a single 
image of a person on a natural background. Manual initialization is used to estimate a rough 
pose (skeleton) and height model, and this is then used to segment the person’s outline using 
the Grab Cut segmentation algorithm (Section 4.3.2). The shape and pose estimate are then 
refined using a combination of silhouette edge cues and shading information (Figure 13.29). 


The resulting 3D model can be used to create novel animations. 
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(a) 


Figure 13.30 Whole body, expression, and gesture fitting from a single image using the 
SMPL-X model from Pavlakos, Choutas et al. (2019) O 2019 IEEE: (a) estimating the major 
joints, skeleton, SMPL, and SMPL-X models from a single image; (b) qualitative results of 
SMPL-X for some in-the-wild images. 


While some of the original work on 3D body and pose fitting was done using the SCAPE 
and BlendSCAPE (Hirshberg, Loper et al. 2012) models, the Skinned Multi-Person Linear 
model (SMPL) developed by Loper, Mahmood et al. (2015) introduced a skinned vertex- 
based model that accurately represents a wide variety of body shapes in natural human 
poses. The model consists of a rest pose template, pose-dependent blend shapes, and identity- 
dependent blend shapes, and is built by training on a large collection of aligned 3D human 
scans. Bogo, Kanazawa et al. (2016) show how the parameters of this 3D model can be 
estimated from just a single image using their SMPLfy method. 

In subsequent work Romero, Tzionas, and Black (2017) extend this model by adding a 
hand Model with Articulated and Non-rigid defOrmations (MANO). Joo, Simon, and Sheikh 
(2018) stitch together the SMPL body model with a face and a hand model to create the 3D 
Frank and Adam models that can track multiple people in a social setting. And Pavlakos, 
Choutas et al. (2019) use thousands of 3D scans to train a new, unified, 3D model of the 
human body (SMPL-X) that extends SMPL with gender-specific models and includes fully 
articulated hands and an expressive face, as shown in Figure 13.30. They also replace the 
mixture of Gaussians prior in SMPL with a variational autoencoder (VAE) and develop a 
new VPoser prior trained on the large-scale AMASS motion capture dataset collected by 
Mahmood, Ghorbani et al. (2019). 

In more recent work, Kocabas, Athanasiou, and Black (2020) introduce VIBE, a system 
for video inference of human body pose and shape that makes use of AMASS. Choutas, 
Pavlakos et al. (2020) develop a system they call ExPose (EXpressive POse and Shape rE- 
gression), which directly regresses the body, face, and hands SMPL-X parameters from an 
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RGB image. The more recent STAR (Sparse Trained Articulated human body Regressor) 
model (Osman, Bolkart, and Black 2020), has many fewer parameters than SMPL and re- 
moves spurious long-range correlations between vertices. It also includes shape-dependent 
pose-corrective blend shapes that depend on both body pose and BMI and also models a much 
wider range of variation in the human population by training STAR with an additional 10,000 
scans of male and female subjects. GHUM and GHUML (Xu, Bazavan et al. 2020) rely on 
non-linear shape spaces constructed from deep variational autoencoders for body and facial 
deformation and on normalizing flow representations for skeleton (body and hand) kinemat- 
1cs. Recent papers that continue to improve the accuracy and speed of single-image model 
fitting on the challenging 3D Poses in the Wild (3DPW) benchmark and dataset (von Marcard, 
Henschel et al. 2018) include Song, Chen, and Hilliges (2020), Joo, Neverova, and Vedaldi 
(2020), and Rong, Shiratori, and Joo (2020). 


Activity recognition. The final widely studied topic in human modeling is motion, activity, 
and action recognition (Bobick 1997; Hu, Tan et al. 2004; Hilton, Fua, and Ronfard 2006). 
Examples of actions that are commonly recognized include walking and running, jumping, 
dancing, picking up objects, sitting down and standing up, and waving. Papers on these topics 
include Robertson and Reid (2006), Sminchisescu, Kanaujia, and Metaxas (2006), Weinland, 
Ronfard, and Boyer (2006), Yilmaz and Shah (2006), and Gorelick, Blank et al. (2007), as 
well as more recent video understanding papers such as the ones we covered in Section 6.5, 
e.g., Carreira and Zisserman (2017), Tran, Wang et al. (2018), Tran, Wang et al. (2019), Wu, 
Feichtenhofer et al. (2019), and Feichtenhofer, Fan et al. (2019). 


13.7 Recovering texture maps and albedos 


After a 3D model of an object or person has been acquired, the final step in modeling is 
usually to recover a texture map to describe the object’s surface appearance. This first requires 
establishing a parameterization for the (u, v) texture coordinates as a function of 3D surface 
position.?! One simple way to do this is to associate a separate texture map with each triangle 
(or pair of triangles). More space-efficient techniques involve unwrapping the surface onto 
one or more maps, e.g., using a subdivision mesh (Section 13.3.2) (Eck, DeRose et al. 1995) 
or a geometry image (Section 13.3.3) (Gu, Gortler, and Hoppe 2002). 

Once the (u,v) coordinates for each triangle have been fixed, the perspective projec- 


tion equations mapping from texture (u, v) to an image j’s pixel (uj, vj) coordinates can be 


21 Although a few recent papers have directly constructed a mapping from (æ, y, z) to color values (Saito, Huang 
et al. 2019; Saito, Simon et al. 2020; Mildenhall, Srinivasan et al. 2020)—see Section 14.6. 
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obtained by concatenating the affine (u,v) > (X,Y, Z) mapping with the perspective ho- 
mography (X,Y, Z) — (uj,v;) (Szeliski and Shum 1997). The color values for the (u, v) 
texture map can then be re-sampled and stored, or the original image can itself be used as the 
texture source using projective texture mapping (OpenGL-ARB 1997). 

The situation becomes more involved when more than one source image is available for 
appearance recovery, which is the usual case. One possibility is to use a view-dependent 
texture map (Section 14.1.1), in which a different source image (or combination of source 
images) is used for each polygonal face based on the angles between the virtual camera, the 
surface normals, and the source images (Debevec, Taylor, and Malik 1996; Pighin, Hecker 
et al. 1998). An alternative approach is to estimate a complete Surface Light Field for each 
surface point (Wood, Azuma et al. 2000), as described in Section 14.3.2. 

In some situations, e.g., when using models in traditional 3D games, it is preferable to 
merge all of the source images into a single coherent texture map during pre-processing 
(Weinhaus and Devarajan 1997). Ideally, each surface triangle should select the source image 
where it is seen most directly (perpendicular to its normal) and at the resolution best matching 
the texture map resolution.2? This can be posed as a graph cut optimization problem, where 
the smoothness term encourages adjacent triangles to use similar source images, followed by 
blending to compensate for exposure differences (Lempitsky and Ivanov 2007; Sinha, Steedly 
et al. 2008). Even better results can be obtained by explicitly modeling geometric and pho- 
tometric misalignments between the source images (Shum and Szeliski 2000; Gal, Wexler 
et al. 2010; Waechter, Moehrle, and Goesele 2014; Zhou and Koltun 2014; Huang, Dai et 
al. 2017; Fu, Yan et al. 2018; Schóps, Sattler, and Pollefeys 2019b; Lee, Ha et al. 2020). 
“Neural” texture map representations can also be used as an alternative to RGB color fields 
(Oechsle, Mescheder et al. 2019; Mihajlovic, Weder et al. 2021). Zollhófer, Stotko et al. 
(2018, Section 4.1) discuss related techniques in more detail. 

These kinds of approaches produce good results when the lighting stays fixed with respect 
to the object, i.e., when the camera moves around the object or space. When the lighting is 
strongly directional, however, and the object is being moved relative to this lighting, strong 
shading effects or specularities may be present, which will interfere with the reliable recov- 
ery of a texture (albedo) map. In this case, it is preferable to explicitly undo the shading 
effects (Section 13.1) by modeling the light source directions and estimating the surface re- 
flectance properties while recovering the texture map (Sato and Ikeuchi 1996; Sato, Wheeler, 
and Ikeuchi 1997; Yu and Malik 1998; Yu, Debevec et al. 1999). Figure 13.31 shows the 


results of one such approach, where the specularities are first removed while estimating the 


2 When surfaces are seen at oblique viewing angles, it may be necessary to blend different images together to 
obtain the best resolution (Wang, Kang et al. 2001). 
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(b) 


Figure 13.31 Estimating the diffuse albedo and reflectance parameters for a scanned 3D 
model (Sato, Wheeler, and Ikeuchi 1997) © 1997 ACM: (a) set of input images projected 
onto the model; (b) the complete diffuse reflection (albedo) model; (c) rendering from the 


reflectance model including the specular component. 


matte reflectance component (albedo) and then later re-introduced by estimating the specular 
component k, in a Torrance-Sparrow reflection model (2.92). 


13.7.1 Estimating BRDFs 


A more ambitious approach to the problem of view-dependent appearance modeling is to 
estimate a general bidirectional reflectance distribution function (BRDF) for each point on 
an object’s surface. Dana, van Ginneken et al. (1999), Jensen, Marschner et al. (2001), and 
Lensch, Kautz et al. (2003) present different techniques for estimating such functions, while 
Dorsey, Rushmeier, and Sillion (2007) and Weyrich, Lawrence et al. (2009) provide surveys 
of the topics of BRDF modeling, recovery, and rendering. 

As we saw in Section 2.2.2 (2.82), the BRDF can be written as 


fr (Gi, Gi, Or, Pri A), (13.6) 


where (;, di) and (0,., r) are the angles the incident Y; and reflected ¥,. light ray directions 
make with the local surface coordinate frame (dz, dy, ñ) shown in Figure 2.15. When mod- 
eling the appearance of an object, as opposed to the appearance of a patch of material, we 
need to estimate this function at every point (x, y) on the object’s surface, which gives us the 
spatially varying BRDF, or SVBRDF (Weyrich, Lawrence et al. 2009), 


fol Y, Oi, Pis Or, Or; A). (13.7) 


If sub-surface scattering effects are being modeled, such as the long-range transmission 
of light through materials such as alabaster, the eight-dimensional bidirectional scattering- 


surface reflectance-distribution function (BSSRDF) is used instead, 


felti, Yi, Oi, Pi, Ze, Yer De, Pes A), (13.8) 
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Figure 13.32 Image-based reconstruction of appearance and detailed geometry (Lensch, 
Kautz et al. 2003) © 2003 ACM. (a) Appearance models (BRDFs) are re-estimated using 


divisive clustering. (b) To model detailed spatially varying appearance, each lumitexel is 


projected onto the basis formed by the clustered materials. 


where the e subscript now represents the emitted rather than the reflected light directions. 

Weyrich, Lawrence et al. (2009) provide a nice survey of these and related topics, includ- 
ing basic photometry, BRDF models, traditional BRDF acquisition using gonio reflectome- 
try, 1.e., the precise measurement of visual angles and reflectances (Marschner, Westin et al. 
2000; Dupuy and Jakob 2018), multiplexed illumination (Schechner, Nayar, and Belhumeur 
2009), skin modeling (Debevec, Hawkins et al. 2000; Weyrich, Matusik et al. 2006), and 
image-based acquisition techniques, which simultaneously recover an object’s 3D shape and 
reflectometry from multiple photographs. 

A nice example of this latter approach is the system developed by Lensch, Kautz et al. 
(2003), who estimate locally varying BRDFs and refine their shape models using local esti- 
mates of surface normals. To build up their models, they first associate a lumitexel, which 
contains a 3D position, a surface normal, and a set of sparse radiance samples, with each 
surface point. Next, they cluster such lumitexels into materials that share common proper- 
ties, using a Lafortune reflectance model (Lafortune, Foo et al. 1997) and a divisive cluster- 
ing approach (Figure 13.32a). Finally, to model detailed spatially varying appearance, each 
lumitexel (surface point) is projected onto the basis of clustered appearance models (Fig- 
ure 13.32b). A more accurate system for estimating normals can be obtained using polarized 
lighting, as described by Ma, Hawkins et al. (2007). 

More recent approaches to recovering spatially varying BRDFs (SVBRDFs) either start 
with RGB-D scanners (Park, Newcombe, and Seitz 2018; Schmitt, Donne et al. 2020), flash/no- 
flash image pairs (Aittala, Weyrich, and Lehtinen 2015), or use deep learning approaches to 
simultaneously estimate surface normals and appearance models (Li, Sunkavalli, and Chan- 
draker 2018; Li, Xu et al. 2018). Even more sophisticated systems can also estimate shape 


and environmental lighting from range scanner sequences (Park, Holynski, and Seitz 2020) or 
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single monocular images (Boss, Jampani et al. 2020; Li, Shafiei et al. 2020; Chen, Nobuhara, 
and Nishino 2020) and even perform relighting on such scenes (Bi, Xu et al. 2020a,b; Sang 
and Chandraker 2020; Bi, Xu ef al. 2020c). A more in-depth review of techniques for captur- 
ing the 3D shape and appearance of objects with RGB-D cameras can be found in the state of 
the art report by Zollhófer, Stotko et al. (2018). 

While most of the techniques discussed in this section require large numbers of views to 
estimate surface properties, an interesting challenge is to take these techniques out of the lab 
and into the real world, and to combine them with regular and internet photo image-based 


modeling approaches. 


13.7.2 Application: 3D model capture 


The techniques described in this chapter for building complete 3D models from multiple 
images and then recovering their surface appearance have opened up a whole new range of 
applications that often go under the name 3D photography. Pollefeys and Van Gool (2002) 
and Pollefeys, Van Gool et al. (2004) provide nice introductions to such systems, including 
the processing steps of feature matching, structure from motion recovery, dense depth map 
estimation, 3D model building, and texture map recovery. A complete web-based system for 
automatically performing all of these tasks, called ARC3D, is described by Vergauwen and 
Van Gool (2006) and Moons, Van Gool, and Vergauwen (2010). The latter paper provides not 
only an in-depth survey of this whole field but also a detailed description of their complete 
end-to-end system. 

An example of a more recent commercial photogrammetric modeling system that can be 
used for both object and scene capture is Pix4D, whose website shows a wonderful example of 
a 3D texture-mapped castle reconstructed from both regular and aerial drone photographs.” 
Examples of casual 3D photography enabled by the advent of smartphones include Hedman, 
Alsisan et al. (2017), Hedman and Kopf (2018), and Kopf, Matzen et al. (2020) and are 
described in more detail in Section 14.2.2. 

An alternative to such fully automated systems is to put the user in the loop in what is 
sometimes called interactive computer vision. An early example of this was the Façade archi- 
tectural modeling system developed by Debevec, Taylor, and Malik (1996). van den Hengel, 
Dick et al. (2007) describe their VideoTrace system, which performs automated point track- 
ing and 3D structure recovery from video and then lets the user draw triangles and surfaces 
on top of the resulting point cloud, as well as interactively adjusting the locations of model 
vertices. Sinha, Steedly ef al. (2008) describe a related system that uses matched vanishing 


3 https://www.pix4d.com/blog/mapping-chillon-castle- with-drone 
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points in multiple images (Figure 7.50) to infer 3D line orientations and plane normals. These 
are then used to guide the user drawing axis-aligned planes, which are automatically fitted to 
the recovered 3D point cloud. Fully automated variants on these ideas are described by Zebe- 
din, Bauer et al. (2008), Furukawa, Curless et al. (2009a), Furukawa, Curless et al. (2009b), 
Micusík and Kosecká (2009), and Sinha, Steedly, and Szeliski (2009). 

As the sophistication and reliability of these techniques continues to improve, we can ex- 
pect to see even more user-friendly applications for photorealistic 3D modeling from images 
(Exercise 13.8). 


13.8 Additional reading 


Shape from shading is one of the classic problems in computer vision (Horn 1975). Some 
representative papers in this area include those by Horn (1977), Ikeuchi and Horn (1981), 
Pentland (1984), Horn and Brooks (1986), Horn (1990), Szeliski (1991a), Mancini and Wolff 
(1992), Dupuis and Oliensis (1994), and Fua and Leclerc (1995). The collection of papers 
edited by Horn and Brooks (1989) is a great source of information on this topic, especially 
the chapter on variational approaches. The survey by Zhang, Tsai et al. (1999) reviews such 
techniques and also provides some comparative results. 

Woodham (1981) wrote the seminal paper of photometric stereo. Shape from texture tech- 
niques include those by Witkin (1981), Ikeuchi (1981), Blostein and Ahuja (1987), Gárding 
(1992), Malik and Rosenholtz (1997), Liu, Collins, and Tsin (2004), Liu, Lin, and Hays 
(2004), Hays, Leordeanu et al. (2006), Lin, Hays et al. (2006), Lobay and Forsyth (2006), 
White and Forsyth (2006), White, Crane, and Forsyth (2007), and Park, Brocklehurst et 
al. (2009). Good papers and books on depth from defocus have been written by Pentland 
(1987), Nayar and Nakagawa (1994), Nayar, Watanabe, and Noguchi (1996), Watanabe and 
Nayar (1998), Chaudhuri and Rajagopalan (1999), and Favaro and Soatto (2006). Additional 
techniques for recovering shape from various kinds of illumination effects, including inter- 
reflections (Nayar, Ikeuchi, and Kanade 1991), are discussed in the book on shape recovery 
edited by Wolff, Shafer, and Healey (1992b). A more recent survey on photometric stereo 
1s Ackermann and Goesele (2015) and recent papers include Logothetis, Mecca, and Cipolla 
(2019), Haefner, Ye et al. (2019), and Santo, Waechter, and Matsushita (2020). 

Active rangefinding systems, which use laser or natural light illumination projected into 
the scene, have been described by Besl (1989), Rioux and Bird (1993), Kang, Webb et al. 
(1995), Curless and Levoy (1995), Curless and Levoy (1996), Proesmans, Van Gool, and 
Defoort (1998), Bouguet and Perona (1999), Curless (1999), Hebert (2000), Iddan and Ya- 
hav (2001), Goesele, Fuchs, and Seidel (2003), Scharstein and Szeliski (2003), Davis, Ra- 
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mamoorthi, and Rusinkiewicz (2003), Zhang, Curless, and Seitz (2003), Zhang, Snavely et 
al. (2004), and Moons, Van Gool, and Vergauwen (2010), and in the more recent reviews by 
Zhang (2018) and Ikeuchi, Matsushita ef al. (2020). Individual range scans can be aligned us- 
ing 3D correspondence and distance optimization techniques such as iterative closest points 
and its variants (Besl and McKay 1992; Zhang 1994; Szeliski and Lavallée 1996; Johnson 
and Kang 1997; Gold, Rangarajan et al. 1998; Johnson and Hebert 1999; Pulli 1999; David, 
DeMenthon et al. 2004; Li and Hartley 2007; Enqvist, Josephson, and Kahl 2009; Pomer- 
leau, Colas, and Siegwart 2015; Rusinkiewicz 2019). Once they have been aligned, range 
scans can be merged using techniques that model the signed distance of surfaces to volumet- 
ric sample points (Hoppe, DeRose et al. 1992; Curless and Levoy 1996; Hilton, Stoddart et 
al. 1996; Wheeler, Sato, and Ikeuchi 1998; Kazhdan, Bolitho, and Hoppe 2006; Lempitsky 
and Boykov 2007; Zach, Pock, and Bischof 2007b; Zach 2008; Newcombe, Izadi et al. 2011; 
Zhou, Miller, and Koltun 2013; Newcombe, Fox, and Seitz 2015; Zollhófer, Stotko et al. 
2018). 


Once constructed, 3D surfaces can be modeled and manipulated using a variety of three- 
dimensional representations, which include triangle meshes (Eck, DeRose et al. 1995; Hoppe 
1996), splines (Farin 1992; Lee, Wolberg, and Shin 1997; Farin 2002), subdivision sur- 
faces (Stollnitz, DeRose, and Salesin 1996; Zorin, Schröder, and Sweldens 1996; Warren and 
Weimer 2001; Peters and Reif 2008), and geometry images (Gu, Gortler, and Hoppe 2002). 
Alternatively, they can be represented as collections of point samples with local orientation 
estimates (Hoppe, DeRose et al. 1992; Szeliski and Tonnesen 1992; Turk and O’ Brien 2002; 
Pfister, Zwicker et al. 2000; Alexa, Behr et al. 2003; Pauly, Keiser et al. 2003; Diebel, Thrun, 
and Briinig 2006; Guennebaud and Gross 2007; Guennebaud, Germann, and Gross 2008; 
Oztireli, Guennebaud, and Gross 2008; Berger, Tagliasacchi et al. 2017). They can also be 
modeled using implicit inside—outside characteristic or signed distance functions sampled 
on regular or irregular (octree) volumetric grids (Lavallée and Szeliski 1995; Szeliski and 
Lavallée 1996; Frisken, Perry et al. 2000; Dinh, Turk, and Slabaugh 2002; Kazhdan, Bolitho, 
and Hoppe 2006; Lempitsky and Boykov 2007; Zach, Pock, and Bischof 2007b; Zach 2008; 
Kazhdan and Hoppe 2013). 


The literature on model-based 3D reconstruction is extensive. For modeling architecture 
and urban scenes, both interactive and fully automated systems have been developed. A 
special journal issue devoted to the reconstruction of large-scale 3D scenes (Zhu and Kanade 
2008) is a good source of references and Robertson and Cipolla (2009) give a nice description 


of a complete system. Lots of additional references can be found in Section 13.6.1. 


Face and whole body modeling and tracking is a very active sub-field of computer vi- 


sion, with its own conferences and workshops, e.g., the International Conference on Auto- 


13.9 Exercises 857 


matic Face and Gesture Recognition (FG) and IEEE Workshop on Analysis and Modeling of 
Faces and Gestures (AMFG). Two recent survey papers on 3D face modeling and tracking 
are Zollhófer, Thies et al. (2018) and Egger, Smith et al. (2020), while surveys on the topic of 
whole body modeling and tracking include Forsyth, Arikan et al. (2006), Moeslund, Hilton, 
and Kriiger (2006), and Sigal, Balan, and Black (2010). 

Some representative papers on recovering texture maps from multiple color and RGB-D 
images include Gal, Wexler et al. (2010), Waechter, Moehrle, and Goesele (2014), Zhou and 
Koltun (2014), and Lee, Ha et al. (2020) as well as Zollhófer, Stotko et al. (2018, Section 4.1). 
The more complex process of recovering spatially varying BRDFs is covered in surveys by 
Dorsey, Rushmeier, and Sillion (2007) and Weyrich, Lawrence et al. (2009). More recent 
techniques that can do this using fewer images and RGB-D images include Aittala, Weyrich, 
and Lehtinen (2015), Li, Sunkavalli, and Chandraker (2018), Schmitt, Donne et al. (2020), 
and Boss, Jampani et al. (2020) and the survey by Zollhófer, Stotko et al. (2018). 


13.9 Exercises 


Ex 13.1: Shape from focus. Grab a series of focused images with a digital SLR set to man- 
ual focus (or get one that allows for programmatic focus control) and recover the depth of an 
object. 


1. Take some calibration images, e.g., of a checkerboard, so that you can compute a map- 


ping between the amount of defocus and the focus setting. 


2. Try both a fronto-parallel planar target and one which is slanted so that it covers the 


working range of the sensor. Which one works better? 
3. Now put a real object in the scene and perform a similar focus sweep. 


4. For each pixel, compute the local sharpness and fit a parabolic curve over focus settings 


to find the most in-focus setting. 


5. Map these focus settings to depth and compare your result to ground truth. If you are 
using a known simple object, such as a sphere or cylinder (a ball or a soda can), it’s 


easy to measure its true shape. 
6. (Optional) See if you can recover the depth map from just two or three focus settings. 


7. (Optional) Use an LCD projector to project artificial texture onto the scene. Use a pair 
of cameras to compare the accuracy of your shape from focus and shape from stereo 
techniques. 
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8. (Optional) Create an all-in-focus image using the technique of Agarwala, Dontcheva et 
al. (2004). 


Ex 13.2: Shadow striping. Implement the handheld shadow striping system of Bouguet 
and Perona (1999). The basic steps include the following: 


1. Set up two background planes behind the object of interest and calculate their orienta- 
tion relative to the viewer, e.g., with fiducial marks. 


2. Cast a moving shadow with a stick across the scene; record the video or capture the 
data with a webcam. 


3. Estimate each light plane equation from the projections of the cast shadow against the 
two backgrounds. 


4. Triangulate to the remaining points on each curve to get a 3D stripe and display the 
stripes using a 3D graphics engine. 


5. (Optional) remove the requirement for a known second (vertical) plane and infer its 
location (or that of the light source) using the techniques described by Bouguet and 
Perona (1999). The techniques from Exercise 10.9 may also be helpful here. 


Ex 13.3: Range data registration. Register two or more 3D datasets using either iterative 
closest points (ICP) (Besl and McKay 1992; Zhang 1994; Gold, Rangarajan et al. 1998) or 
octree signed distance fields (Szeliski and Lavallée 1996) (Section 13.2.1). 

Apply your technique to narrow-baseline stereo pairs, e.g., obtained by moving a cam- 
era around an object, using structure from motion to recover the camera poses, and using a 


standard stereo matching algorithm. 


Ex 13.4: Range data merging. Merge the datasets that you registered in the previous ex- 
ercise using signed distance fields (Curless and Levoy 1996; Hilton, Stoddart et al. 1996) 
or one of their newer variants (Newcombe, Izadi et al. 2011; Hornung, Wurm et al. 2013; 
NieBner, Zollhófer et al. 2013; Klingensmith, Dryanovski et al. 2015; Dai, Nießner et al. 
2017; Zollhdfer, Stotko et al. 2018). Extract a meshed surface model from the signed dis- 
tance field using marching cubes and display the resulting model. 


Ex 13.5: Surface simplification. Use progressive meshes (Hoppe 1996) or some other tech- 


nique from Section 13.3.2 to create a hierarchical simplification of your surface model. 


Ex 13.6: Architectural modeler. Build a 3D interior or exterior model of some architec- 
tural structure, such as your house, from a series of handheld wide-angle photographs. 
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1. Extract lines and vanishing points (Exercises 7.11—7.14) to estimate the dominant di- 


rections in each image. 


2. Use structure from motion to recover all of the camera poses and match up the vanish- 


ing points. 


3. Let the user sketch the locations of the walls by drawing lines corresponding to wall 
bottoms, tops, and horizontal extents onto the images (Sinha, Steedly et al. 2008)— 
see also Exercise 11.4. Do something similar for openings (doors and windows) and 


simple furniture (tables and countertops). 


4. Convert the resulting polygonal meshes into a 3D model (e.g., VRML) and optionally 


texture-map these surfaces from the images. 


Ex 13.7: Body tracker. Download some human body movement sequences from one of 
the datasets such as HumanEva, MPI FAUST, or AMASS discussed in Section 13.6.4. Either 
implement a human motion tracker from scratch or extend existing code in some interesting 


way. 


Ex 13.8: 3D photography. Combine all of your previously developed techniques to pro- 
duce a system that takes a series of photographs or a video and constructs a photorealistic 
texture-mapped 3D model. 
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Figure 14.1 Image-based and video-based rendering: (a) a 3D view of a Photo Tourism 
reconstruction (Snavely, Seitz, and Szeliski 2006) © 2006 ACM; (b) a slice through a 4D light 
field (Gortler, Grzeszczuk et al. 1996) © 1996 ACM; (c) sprites with depth (Shade, Gortler 
et al. 1998) © 1998 ACM; (d) surface light field (Wood, Azuma et al. 2000) © 2000 ACM; 
(e) environment matte in front of a novel background (Zongker, Werner et al. 1999) © 1999 
ACM; (f) video view interpolation (Zitnick, Kang et al. 2004) © 2004 ACM; (g) Video Rewrite 
used to re-animate old video (Bregler, Covell, and Slaney 1997) © 1997 ACM; (h) video 
texture of a candle flame (Schódl, Szeliski et al. 2000) © 2000 ACM; (i) hyperlapse video, 
stitching multiple frames with 3D proxies (Kopf, Cohen, and Szeliski 2014) © 2014 ACM. 
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Over the last few decades, image-based rendering has emerged as one of the most exciting 
applications of computer vision (Kang, Li ef al. 2006; Shum, Chan, and Kang 2007; Gallo, 
Troccoli et al. 2020). In image-based rendering, 3D reconstruction techniques from computer 
vision are combined with computer graphics rendering techniques that use multiple views of 
a scene to create interactive photo-realistic experiences such as the Photo Tourism system 
shown in Figure 14.1a. Commercial versions of such systems include immersive street-level 
navigation in online mapping systems such as Google Maps and the creation of 3D Photo- 
synths from large collections of casually acquired photographs. 

In this chapter, we explore a variety of image-based rendering techniques, such as those 
illustrated in Figure 14.1. We begin with view interpolation (Section 14.1), which creates a 
seamless transition between a pair of reference images using one or more precomputed depth 
maps. Closely related to this idea are view-dependent texture maps (Section 14.1.1), which 
blend multiple texture maps on a 3D model’s surface. The representations used for both the 
color imagery and the 3D geometry in view interpolation include a number of clever variants 
such as layered depth images (Section 14.2) and sprites with depth (Section 14.2.1). 

We continue our exploration of image-based rendering with the light field and Lumigraph 
four-dimensional representations of a scene’s appearance (Section 14.3), which can be used 
to render the scene from any arbitrary viewpoint. Variants on these representations include 
the unstructured Lumigraph (Section 14.3.1), surface light fields (Section 14.3.2), concentric 
mosaics (Section 14.3.3), and environment mattes (Section 14.4). 

We then explore the topic of video-based rendering, which uses one or more videos To 
create novel video-based experiences (Section 14.5). The topics we cover include video- 
based facial animation (Section 14.5.1), as well as video textures (Section 14.5.2), in which 
short video clips can be seamlessly looped to create dynamic real-time video-based render- 
ings of a scene. 

We continue with a discussion of 3D videos created from multiple video streams (Sec- 
tion 14.5.4), as well as video-based walkthroughs of environments (Section 14.5.5), which 
have found widespread application in immersive outdoor mapping and driving direction sys- 
tems. We finish this chapter with a review of recent work in neural rendering (Section 14.6), 
where generative neural networks are used to create more realistic reconstructions of both 


static scenes and objects as well as people. 


14.1 View interpolation 


While the term image-based rendering first appeared in the papers by Chen (1995) and 
McMillan and Bishop (1995), the work on view interpolation by Chen and Williams (1993) 
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(b) (c) (d) 


(a) 


Figure 14.2 View interpolation (Chen and Williams 1993) © 1993 ACM: (a) holes from 


one source image (shown in blue); (b) holes after combining two widely spaced images; (c) 


holes after combining two closely spaced images; (d) after interpolation (hole filling). 


is considered as the seminal paper in the field. In view interpolation, pairs of rendered images 
are combined with their precomputed depth maps to generate interpolated views that mimic 
what a virtual camera would see in between the two reference views. Since its original in- 
troduction, the whole field of novel view synthesis from captured images has continued to be 
a very active area. A good historical overview and recent results can be found in the CVPR 
tutorial on this topic (Gallo, Troccoli et al. 2020). 

View interpolation combines two ideas that were previously used in computer vision and 
computer graphics. The first is the idea of pairing a recovered depth map with the refer- 
ence image used in its computation and then using the resulting texture-mapped 3D model 
to generate novel views (Figure 12.1). The second is the idea of morphing (Section 3.6.3) 
(Figure 3.51), where correspondences between pairs of images are used to warp each refer- 
ence image to an in-between location while simultaneously cross-dissolving between the two 
warped images. 

Figure 14.2 illustrates this process in more detail. First, both source images are warped 
to the novel view, using both the knowledge of the reference and virtual 3D camera pose 
along with each image’s depth map (2.68-2.70). In the paper by Chen and Williams (1993), 
a forward warping algorithm (Algorithm 3.1 and Figure 3.45) is used. The depth maps are 
represented as quadtrees for both space and rendering time efficiency (Samet 1989). 

During the forward warping process, multiple pixels (which occlude one another) may 
land on the same destination pixel. To resolve this conflict, either a z-buffer depth value can 
be associated with each destination pixel or the images can be warped in back-to-front order, 
which can be computed based on the knowledge of epipolar geometry (Chen and Williams 
1993; Laveau and Faugeras 1994; McMillan and Bishop 1995). 


Once the two reference images have been warped to the novel view (Figure 14.2a—b), they 
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can be merged to create a coherent composite (Figure 14.2c). Whenever one of the images 
has a hole (illustrated as a cyan pixel), the other image is used as the final value. When both 
images have pixels to contribute, these can be blended as in usual morphing, i.e., according 
to the relative distances between the virtual and source cameras. Note that if the two images 
have very different exposures, which can happen when performing view interpolation on real 
images, the hole-filled regions and the blended regions will have different exposures, leading 
to subtle artifacts. 

The final step in view interpolation (Figure 14.2d) is to fill any remaining holes or cracks 
due to the forward warping process or lack of source data (scene visibility). This can be done 
by copying pixels from the further pixels adjacent to the hole. (Otherwise, foreground objects 
are subject to a “fattening effect”.) 

The above process works well for rigid scenes, although its visual quality (lack of alias- 
ing) can be improved using a two-pass, forward—backward algorithm (Section 14.2.1) (Shade, 
Gortler et al. 1998) or full 3D rendering (Zitnick, Kang et al. 2004). In the case where the 
two reference images are views of a non-rigid scene, e.g., a person smiling in one image and 
frowning in the other, view morphing, which combines ideas from view interpolation with 
regular morphing, can be used (Seitz and Dyer 1996). A depth map fitted to a face can also 
be used to synthesize a view from a longer distance, removing the enlarged nose and other 
facial features common to “selfie” photography (Fried, Shechtman et al. 2016). 

While the original view interpolation paper describes how to generate novel views based 
on similar precomputed (linear perspective) images, the plenoptic modeling paper of McMil- 
lan and Bishop (1995) argues that cylindrical images should be used to store the precomputed 
rendering or real-world images. Chen (1995) also proposes using environment maps (cylin- 


drical, cubic, or spherical) as source images for view interpolation. 


14.1.1 View-dependent texture maps 


View-dependent texture maps (Debevec, Taylor, and Malik 1996) are closely related to view 
interpolation. Instead of associating a separate depth map with each input image, a single 3D 
model is created for the scene, but different images are used as texture map sources depending 
on the virtual camera’s current position (Figure 14.3a).! 

In more detail, given a new virtual camera position, the similarity of this camera’s view 
of each polygon (or pixel) is compared to that of potential source images. The images are 


then blended using a weighting that is inversely proportional to the angles a; between the 


'The term image-based modeling, which is now commonly used to describe the creation of texture-mapped 3D 
models from multiple images, appears to have first been used by Debevec, Taylor, and Malik (1996), who also used 


the term photogrammetric modeling to describe the same process. 
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Figure 14.3  View-dependent texture mapping (Debevec, Taylor, and Malik 1996) O 1996 
ACM. (a) The weighting given to each input view depends on the relative angles between the 
novel (virtual) view and the original views; (b) simplified 3D model geometry; (c) with view- 


dependent texture mapping, the geometry appears to have more detail (recessed windows). 


virtual view and the source views (Figure 14.3a).? Even though the geometric model can be 
fairly coarse (Figure 14.3b), blending different views gives a strong sense of more detailed 
geometry because of the visual motion between corresponding pixels. While the original pa- 
per performs the weighted blend computation separately at each pixel or coarsened polygon 
face, follow-on work by Debevec, Yu, and Borshukov (1998) presents a more efficient im- 
plementation based on precomputing contributions for various portions of viewing space and 
then using projective texture mapping (OpenGL-ARB 1997). 

The idea of view-dependent texture mapping has been used in a large number of sub- 
sequent image-based rendering systems, including facial modeling and animation (Pighin, 
Hecker et al. 1998) and 3D scanning and visualization (Pulli, Abi-Rached et al. 1998). 
Closely related to view-dependent texture mapping is the idea of blending between light rays 
in 4D space, which forms the basis of the Lumigraph and unstructured Lumigraph systems 
(Section 14.3) (Gortler, Grzeszczuk et al. 1996; Buehler, Bosse et al. 2001). 

To provide even more realism in their Façade system, Debevec, Taylor, and Malik (1996) 
also include a model-based stereo component, which computes an offset (parallax) map for 
each coarse planar facet of their model. They call the resulting analysis and rendering system 
a hybrid geometry- and image-based approach, as it uses traditional 3D geometric modeling 
to create the global 3D model, but then uses local depth offsets, along with view interpola- 
tion, to add visual realism. Instead of warping per-pixel depth maps or coarser triangulated 
geometry (as in unstructured Lumigraphs, Section 14.3.1), it is also possible to use super- 
pixels as the basic primitives being warped (Chaurasia, Duchene ef al. 2013). Fixed rules 
for view-dependent blending can also be replaced with deep neural networks, as in the deep 
blending system by Hedman, Philip et al. (2018). 


2More sophisticated blending weights are discussed in Section 14.3.1 on unstructured Lumigraph rendering. 
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Figure 14.4 Photo Tourism (Snavely, Seitz, and Szeliski 2006) © 2006 ACM: (a) a 3D 
overview of the scene, with translucent washes and lines painted onto the planar impostors; 
(b) once the user has selected a region of interest, a set of related thumbnails is displayed 
along the bottom; (c) planar proxy selection for optimal stabilization (Snavely, Garg et al. 
2008) O 2008 ACM. 


14.1.2 Application: Photo Tourism 


While view interpolation was originally developed to accelerate the rendering of 3D scenes 
on low-powered processors and systems without graphics acceleration, 1t turns out that it 
can be applied directly to large collections of casually acquired photographs. The Photo 
Tourism system developed by Snavely, Seitz, and Szeliski (2006) uses structure from motion 
to compute the 3D locations and poses of all the cameras taking the images, along with a 
sparse 3D point-cloud model of the scene (Section 11.4.6, Figure 11.17). 

To perform an image-based exploration of the resulting sea ofimages (Aliaga, Funkhouser 
et al. 2003), Photo Tourism first associates a 3D proxy with each image. While a triangulated 
mesh obtained from the point cloud can sometimes form a suitable proxy, e.g., for outdoor ter- 
rain models, a simple dominant plane fit to the 3D points visible in each image often performs 
better, because it does not contain any erroneous segments or connections that pop out as ar- 
tifacts. As automated 3D modeling techniques continue to improve, however, the pendulum 
may swing back to more detailed 3D geometry (Goesele, Snavely et al. 2007; Sinha, Steedly, 
and Szeliski 2009). One example is the hybrid rendering system developed by Goesele, Ack- 
ermann et al. (2010), who use dense per-image depth maps for the well-reconstructed portions 
of each image and 3D colored point clouds for the less confident regions. 

The resulting image-based navigation system lets users move from photo to photo, ei- 
ther by selecting cameras from a top-down view of the scene (Figure 14.4a) or by selecting 
regions of interest in an image, navigating to nearby views, or selecting related thumbnails 
(Figure 14.4b). To create a background for the 3D scene, e.g., when being viewed from 


above, non-photorealistic techniques (Section 10.5.2), such as translucent color washes or 
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highlighted 3D line segments, can be used (Figure 14.4a). The system can also be used to 
annotate regions of images and to automatically propagate such annotations to other pho- 
tographs. 

The 3D planar proxies used in Photo Tourism and the related Photosynth system from 
Microsoft result in non-photorealistic transitions reminiscent of visual effects such as “page 
flips”. Selecting a stable 3D axis for all the planes can reduce the amount of swimming and 
enhance the perception of 3D (Figure 14.4c) (Snavely, Garg et al. 2008). It is also possi- 
ble to automatically detect objects in the scene that are seen from multiple views and create 
“orbits” of viewpoints around such objects. Furthermore, nearby images in both 3D posi- 
tion and viewing direction can be linked to create “virtual paths”, which can then be used 
to navigate between arbitrary pairs of images, such as those you might take yourself while 
walking around a popular tourist site (Snavely, Garg et al. 2008). This idea has been fur- 
ther developed and released as a feature on Google Maps called Photo Tours (Kushal, Self 
et al. 2012).? The quality of such synthesized virtual views has become so accurate that 
Shan, Adams ef al. (2013) propose a visual Turing test to distinguish between synthetic and 
real images. Waechter, Beljan et al. (2017) produce higher-resolution quality assessments 
of image-based modeling and rendering system using what they call virtual rephotography. 
Further improvements can be obtained using even more recent neural rendering techniques 
(Hedman, Philip et al. 2018; Meshry, Goldman et al. 2019; Li, Xian et al. 2020), which we 
discuss in Section 14.6. 

The spatial matching of image features and regions performed by Photo Tourism can 
also be used to infer more information from large image collections. For example, Simon, 
Snavely, and Seitz (2007) show how the match graph between images of popular tourist sites 
can be used to find the most iconic (commonly photographed) objects in the collection, along 
with their related tags. In follow-on work, Simon and Seitz (2008) show how such tags can be 
propagated to sub-regions of each image, using an analysis of which 3D points appear in the 
central portions of photographs. Extensions of these techniques to all of the world’s images, 
including the use of GPS tags where available, have been investigated as well (Li, Wu et al. 
2008; Quack, Leibe, and Van Gool 2008; Crandall, Backstrom et al. 2009; Li, Crandall, and 
Huttenlocher 2009; Zheng, Zhao et al. 2009; Raguram, Wu et al. 2011). 


14.2 Layered depth images 


Traditional view interpolation techniques associate a single depth map with each source or 


reference image. Unfortunately, when such a depth map is warped to a novel view, holes and 


3https://maps.googleblog.com/2012/04/visit- global-landmarks-with-photo-tours.html 
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Figure 14.5 A variety of image-based rendering primitives, which can be used depending 
on the distance between the camera and the object of interest (Shade, Gortler et al. 1998) 
© 1998 ACM. Closer objects may require more detailed polygonal representations, while 
mid-level objects can use a layered depth image (LDI), and far-away objects can use sprites 
(potentially with depth) and environment maps. 


cracks inevitably appear behind the foreground objects. One way to alleviate this problem is 
to keep several depth and color values (depth pixels) at every pixel in a reference image (or, 
at least for pixels near foreground—background transitions) (Figure 14.5). The resulting data 
structure, which is called a layered depth image (LDI), can be used to render new views using 
a back-to-front forward warping (splatting) algorithm (Shade, Gortler et al. 1998). 


14.2.1 Impostors, sprites, and layers 


An alternative to keeping lists of color-depth values at each pixel, as is done in the LDI, is 
to organize objects into different layers or sprites. The term sprite originates in the computer 
game industry, where it is used to designate flat animated characters in games such as Pac- 
Man or Mario Bros. When put into a 3D setting, such objects are often called impostors, 
because they use a piece of flat, alpha-matted geometry to represent simplified versions of 
3D objects that are far away from the camera (Shade, Lischinski et al. 1996; Lengyel and 
Snyder 1997; Torborg and Kajiya 1996). In computer vision, such representations are usually 
called layers (Wang and Adelson 1994; Baker, Szeliski, and Anandan 1998; Torr, Szeliski, 
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(b) (c) (d) 


Figure 14.6 Sprites with depth (Shade, Gortler et al. 1998) O 1998 ACM: (a) alpha-matted 
color sprite; (b) corresponding relative depth or parallax; (c) rendering without relative 
depth; (d) rendering with depth (note the curved object boundaries). 


and Anandan 1999; Birchfield, Natarajan, and Tomasi 2007). Section 9.4.2 discusses the 
topics of transparent layers and reflections, which occur on specular and transparent surfaces 
such as glass. 


While flat layers can often serve as an adequate representation of geometry and appear- 
ance for far-away objects, better geometric fidelity can be achieved by also modeling the 
per-pixel offsets relative to a base plane, as shown in Figures 14.5 and 14.6a—b. Such repre- 
sentations are called plane plus parallax in the computer vision literature (Kumar, Anandan, 
and Hanna 1994; Sawhney 1994; Szeliski and Coughlan 1997; Baker, Szeliski, and Anandan 
1998), as discussed in Section 9.4 (Figure 9.14). In addition to fully automated stereo tech- 
niques, it is also possible to paint in depth layers (Kang 1998; Oh, Chen et al. 2001; Shum, 
Sun et al. 2004) or to infer their 3D structure from monocular image cues (Sections 6.4.4 and 
12.8) (Hoiem, Efros, and Hebert 2005b; Saxena, Sun, and Ng 2009). 


How can we render a sprite with depth from a novel viewpoint? One possibility, as with 
a regular depth map, is to just forward warp each pixel to its new location, which can cause 
aliasing and cracks. A better way, which we have already mentioned in Section 3.6.2, is to 
first warp the depth (or (u, v) displacement) map to the novel view, fill in the cracks, and then 
use higher-quality inverse warping to resample the color image (Shade, Gortler et al. 1998). 
Figure 14.6d shows the results of applying such a two-pass rendering algorithm. From this 
still image, you can appreciate that the foreground sprites look more rounded; however, to 
fully appreciate the improvement in realism, you would have to look at the actual animated 
sequence. 

Sprites with depth can also be rendered using conventional graphics hardware, as de- 
scribed in (Zitnick, Kang et al. 2004). Rogmans, Lu et al. (2009) describe GPU imple- 
mentations of both real-time stereo matching and real-time forward and inverse rendering 
algorithms. 
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Figure 14.7 Finely sliced fronto-parallel layers: (a) stack of acetates (Szeliski and Golland 
1999) © 1999 Springer and (b) multiplane images (Zhou, Tucker et al. 2018) © 2018 ACM. 
These representations (which are equivalent) consist of a set of fronto-parallel planes at fixed 
depths from a reference camera coordinate frame, with each plane encoding an RGB image 


and an alpha map that capture the scene appearance at the corresponding depth. 


An alternative to constructing a small number of layers is to discretize the viewing frus- 
tum subtending a layered depth image into a large number of fronto-parallel planes, each of 
which contains RGBA values (Szeliski and Golland 1999), as shown in Figure 14.7. This 
is the same spatial representation we presented in Section 12.1.2 and Figure 12.6 on plane 
sweep approaches to stereo, except that here it is being used to represent a colored 3D scene 
instead of accumulating a matching cost volume. This representation is essentially a per- 
spective variant of a volumetric representation containing RGB color and a opacity values 
(Sections 13.2.1 and 13.5). 

This representation was recently rediscovered and now goes under the popular name of 
multiplane images (MPI) (Zhou, Tucker et al. 2018). Figure 14.8 shows an MPI representa- 
tion derived from a stereo image pair along with a novel synthesized view. MPIs are easier to 
derive from pairs or collections of stereo images than true (minimal) layered representations 
because there is a 1:1 correspondence between pixels (actually, voxels) in a plane sweep cost 
volume (Figure 12.5) and an MPI. However, they are not as compact and can lead to tearing 
artifacts once the viewpoint exceeds a certain range. (We will talk about using inpainting to 
mitigate such holes in image-based representations in Section 14.2.2). MPIs are also related 
to the soft 3D volumetric representation proposed earlier by Penner and Zhang (2017). 

Since their initial development for novel view extrapolation, i.e., “stereo magnification” 
(Zhou, Tucker ef al. 2018), MPIs have found a wide range of applications in image-based 
rendering, including extension to multiple input images and faster inference (Flynn, Broxton 
et al. 2019), CNN refinement and better inpainting (Srinivasan, Tucker et al. 2019), inter- 
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Input images Inferred MPI Representation A novel view synthesized from MP1 
z - 


Figure 14.8 MPI representation constructed from a stereo pair of color images, along 
with a novel view reconstructed from the MPI (Zhou, Tucker et al. 2018) O 2018 ACM. Note 
how the planes slice the 3D scene into thin layers, each ofwhich has colors and full or partial 


opacities in only a small region. 


polating between collections of MPIs (Mildenhall, Srinivasan et al. 2019), and large view 
extrapolations (Choi, Gallo ef al. 2019). The planar MPI structure has also been generalized 
to curved surfaces for representing partial or complete 3D panoramas (Broxton, Flynn et al. 
2020; Attal, Ling et al. 2020; Lin, Xu et al. 2020).* 

Another important application of layers is in the modeling of reflections. When the reflec- 
tor (e.g., a glass pane) is planar, the reflection forms a virtual image, which can be modeled 
as a separate layer (Section 9.4.2 and Figures 9.16-9.17), so long as additive (instead of over) 
compositing is used to combine the reflected and transmitted images (Szeliski, Avidan, and 
Anandan 2000; Sinha, Kopf et al. 2012; Kopf, Langguth et al. 2013). Figure 14.9 shows an 
example of a two-layer decomposition reconstructed from a short video clip, which can be 
re-rendered from novel views by adding warped versions of the two layers (each of which 
has its own depth map). When the reflective surface is curved, a quasi-stable virtual image 
may still be available, although this depends on the local variations in principal curvatures 
(Swaminathan, Kang ef al. 2002; Criminisi, Kang et al. 2005). The modeling of reflections 
is one of the advantages attributed to layered representations such as MPIs (Zhou, Tucker ef 
al. 2018; Broxton, Flynn et al. 2020), although in these papers over compositing is still used, 
which results in plausible but not physically correct renderings. 


14.2.2 Application: 3D photography 


The desire to capture and view photographs of the world in 3D prompted the development 
of stereo cameras and viewers in the mid-1800s (Luo, Kong ef al. 2020) and more recently 


“Exploring the interactive 3D videos on the authors’ websites, e.g., https://augmentedperception.github.io/ 


deepviewvideo, is a good way to get a sense of this new medium. 
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Figure 14.9 Image-based rendering of scenes with reflections using multiple additive lay- 
ers (Sinha, Kopf et al. 2012) O 2012 ACM. The left column shows an image from the input 
sequence and the next two columns show the two separated layers (transmitted and reflected 
light). The last column is an estimate of which portions of the scene are reflective. As you can 
see, stray bits of reflections sometimes cling to the transmitted light layer. Note how in the 
table, the amount of reflected light (gloss) decreases towards the bottom of the image because 
of Fresnel reflection. 


the popularity of 3D movies. It has also underpinned much of the research in 3D shape and 
appearance capture and modeling we studied in the previous chapter and more specifically 
Section 13.7.2. Until recently, however, while the required multiple images could be captured 
with hand-held cameras (Pollefeys, Van Gool et al. 2004; Snavely, Seitz, and Szeliski 2006), 
desktop or laptop computers were required to process and interactively view the images. 
The ability to capture, construct, and widely share such 3D models has dramatically in- 
creased in the last few years and now goes under the name of 3D photography. Hedman, 
Alsisan et al. (2017) describe their Casual 3D Photography system, which takes a sequence 
of overlapping images taken from a moving camera and then uses a combination of structure 
from motion, multi-view stereo, and 3D image warping and stitching to construct two-layer 


partial panoramas that can be viewed on a computer, as shown in Figure 14.10. The Instant 


SIt is interesting to note, however, that for now (at least), in-home 3D TV sets have failed to take off. 
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Figure 14.10 Systems for capturing and modeling 3D scenes from handheld photographs. 
(a) Casual 3D Photography takes a series of overlapping images and constructs per-image 
depth maps, which are then warped and blended together into a two-layer representation 
(Hedman, Alsisan et al. 2017) © 2017 ACM. (b) Instant 3D Photography starts with the depth 
maps produced by a dual-lens smartphone and warps and registers the depth maps to create a 
similar representation with far less computation (Hedman and Kopf 2018) © 2018 ACM. (c) 
One Shot 3D Photography starts with a single photo, performs monocular depth estimation, 
layer construction and inpainting, and mesh and atlas generation, enabling phone-based 
reconstruction and interactive viewing (Kopf, Matzen et al. 2020) © 2020 ACM. 
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3D system of Hedman and Kopf (2018) builds a similar system, but starts with the depth 
images available from newer dual-camera smartphones to significantly speed up the process. 
Note, however, that the individual depth images are not metric, i.e., related to true depth with 
a single global scalar transformation, so must be deformably warped before being stitched 
together. A texture atlas is then constructed to compactly store the pixel color values while 
also supporting multiple layers. 

While these systems produce beautiful wide 3D images that can create a true sense of 
immersion (“being there”), much more practical and fast solutions can be constructed using 
a single depth image. Kopf, Alsisan et al. (2019) describe their phone-based system, which 
takes a single dual-lens photograph with its estimated depth map and constructs a multi-layer 
3D photograph with occluded pixels being inpainted from nearby background pixels (see 
Section 10.5.1 and Shih, Su et al. 2020).° To remove the requirement for depth maps being 
associated with the input images Kopf, Matzen et al. (2020) use a monocular depth inference 
network (Section 12.8) to estimate the depth, thereby enabling 3D photos to be produced 
from any photograph in a phone’s camera roll, or even from historical photographs, as shown 
in Figure 14.10c.? When historic stereographs are available, these can be used to create even 
more accurate 3D photographs, as shown by Luo, Kong et al. (2020). It is also possible to 
create a “3D Ken Burns” effect, 1.e., small looming video clips, from regular images using 
monocular depth inference (Niklaus, Mai et al. 2019).* 


14.3 Light fields and Lumigraphs 


While image-based rendering approaches can synthesize scene renderings from novel view- 


points, they raise the following more general question: 


Is is possible to capture and render the appearance of a scene from all possible 


viewpoints and, if so, what is the complexity of the resulting structure? 


Let us assume that we are looking at a static scene, i.e., one where the objects and illu- 
minants are fixed, and only the observer is moving around. Under these conditions, we can 
describe each image by the location and orientation of the virtual camera (6 dof) as well as 


Facebook rolled out 3D photographs for the iPhone in October 2018, https://facebook360.fb.com/2018/10/11/ 
3d-photos-now-rolling-out-on-facebook-and-in-vr, along with the ability to post and interactively view the photos. 

In February 2020, Facebook released the ability to use regular photos, https://ai.facebook.com/blog/ 
powered-by-ai-turning-any-2d-photo-into-3d-using-convolutional-neural-nets. 

SGoogle released a similar feature called Cinematic photos  https://blog.google/products/photos/ 


new-cinematic-photos-and-more- ways-relive-your- memories. 


876 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


J E plane 


@ Camera center 


(a) (b) 


Figure 14.11 The Lumigraph (Gortler, Grzeszczuk et al. 1996) © 1996 ACM: (a) a ray is 
represented by its 4D two-plane parameters (s,t) and (u,v); (b) a slice through the 3D light 
field subset (u,v, s). 


its intrinsics (e.g., its focal length). However, if we capture a two-dimensional spherical im- 
age around each possible camera location, we can re-render any view from this information.” 
Thus, taking the cross-product of the three-dimensional space of camera positions with the 
2D space of spherical images, we obtain the 5D plenoptic function of Adelson and Bergen 
(1991), which forms the basis of the image-based rendering system of McMillan and Bishop 
(1995). 

Notice, however, that when there is no light dispersion in the scene, i.e., no smoke or fog, 
all the coincident rays along a portion of free space (between solid or refractive objects) have 
the same color value. Under these conditions, we can reduce the 5D plenoptic function to 
the 4D light field of all possible rays (Gortler, Grzeszczuk et al. 1996; Levoy and Hanrahan 
1996; Levoy 2006).!° 

To make the parameterization of this 4D function simpler, let us put two planes in the 
3D scene roughly bounding the area of interest, as shown in Figure 14.11a. Any light ray 
terminating at a camera that lives in front of the st plane (assuming that this space is empty) 
passes through the two planes at (s,¢) and (u, v) and can be described by its 4D coordinate 
(s,t,u,v). This diagram (and parameterization) can be interpreted as describing a family of 
cameras living on the st plane with their image planes being the uv plane. The uv plane 


can be placed at infinity, which corresponds to all the virtual cameras looking in the same 


2 As we are counting dimensions, we ignore for now any sampling or resolution issues. 
10Levoy and Hanrahan (1996) borrowed the term light field from a paper by Gershun (1939). Another name for 
this representation is the photic field (Moon and Spencer 1981). 
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direction. 

In practice, if the planes are of finite extent, the finite light slab L(s, t, u, v) can be used to 
generate any synthetic view that a camera would see through a (finite) viewport in the st plane 
with a view frustum that wholly intersects the far uv plane. To enable the camera to move 
all the way around an object, the 3D space surrounding the object can be split into multiple 
domains, each with its own light slab parameterization. Conversely, if the camera is moving 
inside a bounded volume of free space looking outward, multiple cube faces surrounding the 
camera can be used as (s, t) planes. 

Thinking about 4D spaces is difficult, so let us drop our visualization by one dimension. 
If we fix the row value ¢ and constrain our camera to move along the s axis while looking 
at the uv plane, we can stack all of the stabilized images the camera sees to get the (u, v, s) 
epipolar volume, which we discussed in Section 12.7. A “horizontal” cross-section through 
this volume is the well-known epipolar plane image (Bolles, Baker, and Marimont 1987), 
which is the us slice shown in Figure 14.11b. 

As you can see in this slice, each color pixel moves along a linear track whose slope 
is related to its depth (parallax) from the wv plane. (Pixels exactly on the uv plane appear 
“vertical”, 1.e., they do not move as the camera moves along s.) Furthermore, pixel tracks 
occlude one another as their corresponding 3D surface elements occlude. Translucent pixels, 
however, composite over background pixels (Section 3.1.3 (3.8)) rather than occluding them. 
Thus, we can think of adjacent pixels sharing a similar planar geometry as EPI strips or 
EPI tubes (Criminisi, Kang et al. 2005). 3D lightfields taken from a camera slowly moving 
through a static scene can be an excellent source for high-accuracy 3D reconstruction, as 
demonstrated in the papers by Kim, Zimmer ef al. (2013), Yticer, Kim et al. (2016), and 
Yiicer, Sorkine-Hornung et al. (2016). 

The equations mapping from pixels (x,y) in a virtual camera and the corresponding 
(s, t, u, v) coordinates are relatively straightforward to derive and are sketched out in Ex- 
ercise 14.7. It is also possible to show that the set of pixels corresponding to a regular ortho- 
graphic or perspective camera, i.e., one that has a linear projective relationship between 3D 
points and (a, y) pixels (2.63), lie along a two-dimensional hyperplane in the (s, t, u, v) light 
field (Exercise 14.7). 

While a light field can be used to render a complex 3D scene from novel viewpoints, 
a much better rendering (with less ghosting) can be obtained if something is known about 
its 3D geometry. The Lumigraph system of Gortler, Grzeszczuk et al. (1996) extends the 
basic light field rendering approach by taking into account the 3D location of surface points 
corresponding to each 3D ray. 


Consider the ray (s, u) corresponding to the dashed line in Figure 14.12, which intersects 
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Figure 14.12 Depth compensation in the Lumigraph (Gortler, Grzeszczuk et al. 1996) O 
1996 ACM. To resample the (s,u) dashed light ray, the u parameter corresponding to each 
discrete s; camera location is modified according to the out-of-plane depth z to yield new 
coordinates u and u!; in (u, s) ray space, the original sample (A) is resampled from the 
(s;,u’) and (s;,1,u”) samples, which are themselves linear blends of their adjacent (o) 


samples. 


the object’s surface at a distance z from the uv plane. When we look up the pixel’s color in 
camera s; (assuming that the light field is discretely sampled on a regular 4D (s, t, u, v) grid), 
the actual pixel coordinate is u’, instead of the original u value specified by the (s, u) ray. 
Similarly, for camera s¿,1 (where s; < s < s;41), pixel address u” is used. Thus, instead of 
using quadri-linear interpolation of the nearest sampled (s, t, u, v) values around a given ray 
to determine its color, the (u, v) values are modified for each discrete (s;, ¢;) camera. 

Figure 14.12 also shows the same reasoning in ray space. Here, the original continuous- 
valued (s,u) ray is represented by a triangle and the nearby sampled discrete values are 
shown as circles. Instead of just blending the four nearest samples, as would be indicated 
by the vertical and horizontal dashed lines, the modified (s;, u’) and (s;,1,u”) values are 
sampled instead and their values are then blended. 

The resulting rendering system produces images of much better quality than a proxy-free 
light field and is the method of choice whenever 3D geometry can be inferred. In subsequent 
work, Isaksen, McMillan, and Gortler (2000) show how a planar proxy for the scene, which 
is a simpler 3D model, can be used to simplify the resampling equations. They also describe 
how to create synthetic aperture photos, which mimic what might be seen by a wide-aperture 
lens, by blending more nearby samples (Levoy and Hanrahan 1996). A similar approach can 
be used to re-focus images taken with a plenoptic (microlens array) camera (Ng, Levoy et 
al. 2005; Ng 2005) or a light field microscope (Levoy, Ng et al. 2006). It can also be used 


to see through obstacles, using extremely large synthetic apertures focused on a background 
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that can blur out foreground objects and make them appear translucent (Wilburn, Joshi et al. 
2005; Vaish, Szeliski et al. 2006). 

Now that we understand how to render new images from a light field, how do we go about 
capturing such datasets? One answer is to move a calibrated camera with a motion control rig 
or gantry.'' Another approach is to take handheld photographs and to determine the pose and 
intrinsic calibration of each image using either a calibrated stage or structure from motion. In 
this case, the images need to be rebinned into a regular 4D (s, t, u, v) space before they can 
be used for rendering (Gortler, Grzeszczuk et al. 1996). Alternatively, the original images 
can be used directly using a process called the unstructured Lumigraph, which we describe 
below. 

Because of the large number of images involved, light fields and Lumigraphs can be quite 
voluminous to store and transmit. Fortunately, as you can tell from Figure 14.11b, there is 
a tremendous amount of redundancy (coherence) in a light field, which can be made even 
more explicit by first computing a 3D model, as in the Lumigraph. A number of techniques 
have been developed to compress and progressively transmit such representations (Gortler, 
Grzeszczuk et al. 1996; Levoy and Hanrahan 1996; Rademacher and Bishop 1998; Mag- 
nor and Girod 2000; Wood, Azuma et al. 2000; Shum, Kang, and Chan 2003; Magnor, Ra- 
manathan, and Girod 2003; Zhang and Chen 2004; Shum, Chan, and Kang 2007). 

Since the original burst of research on lightfields in the mid-1990 and early 2000s, better 
techniques continue to be developed for analyzing and rendering such images. Some repre- 
sentative papers and datasets from the last decade include Wanner and Goldluecke (2014), 
Honauer, Johannsen et al. (2016), Kalantari, Wang, and Ramamoorthi (2016), Wu, Masia et 
al. (2017), and Shin, Jeon et al. (2018). 


14.3.1 Unstructured Lumigraph 


When the images in a Lumigraph are acquired in an unstructured (irregular) manner, it can be 
counterproductive to resample the resulting light rays into a regularly binned (s, t, u, v) data 
structure. This is both because resampling always introduces a certain amount of aliasing and 
because the resulting gridded light field can be populated very sparsely or irregularly. 

The alternative is to render directly from the acquired images, by finding for each light 
ray in a virtual camera the closest pixels in the original images. The unstructured Lumigraph 


rendering (ULR) system of Buehler, Bosse et al. (2001) describes how to select such pixels 


1 See http://lightfield.stanford.edu/acq.html for a description of some of the gantries and camera arrays built at the 
Stanford Computer Graphics Laboratory (Wilburn, Joshi et al. 2005). A more recent dataset was created by Honauer, 
Johannsen et al. (2016) and is available at https://lightfield- analysis.uni-konstanz.de Both websites provide light field 


datasets that are a great source of research and project material. 
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by combining a number of fidelity criteria, including epipole consistency (distance of rays 
to a source camera’s center), angular deviation (similar incidence direction on the surface), 
resolution (similar sampling density along the surface), continuity (to nearby pixels), and con- 
sistency (along the ray). These criteria can all be combined to determine a weighting function 
between each virtual camera’s pixel and a number of candidate input cameras from which it 
can draw colors. To make the algorithm more efficient, the computations are performed by 
discretizing the virtual camera’s image plane using a regular grid overlaid with the polyhedral 
object mesh model and the input camera centers of projection and interpolating the weighting 
functions between vertices. 

The unstructured Lumigraph generalizes previous work in both image-based rendering 
and light field rendering. When the input cameras are gridded, the ULR behaves the same way 
as regular Lumigraph rendering. When fewer cameras are available but the geometry is accu- 
rate, the algorithm behaves similarly to view-dependent texture mapping (Section 14.1.1). If 
RGB-D depth images are available, these can be fused into lower-resolution proxies that can 
be combined with higher-resolution source images at rendering time (Hedman, Ritschel et al. 
2016). And while the original ULR paper uses manually constructed rules for determining 
pixel weights, it is also possible to learn such blending weights using a deep neural network 
(Hedman, Philip et al. 2018; Riegler and Koltun 2020a). 


14.3.2 Surface light fields 


Of course, using a two-plane parameterization for a light field is not the only possible choice. 
(It is the one usually presented first, as the projection equations and visualizations are the 
easiest to draw and understand.) As we mentioned on the topic of light field compression, 
if we know the 3D shape of the object or scene whose light field is being modeled, we can 
effectively compress the field because nearby rays emanating from nearby surface elements 
have similar color values. 

In fact, if the object is totally diffuse, ignoring occlusions, which can be handled using 
3D graphics algorithms or z-buffering, all rays passing through a given surface point will 
have the same color value. Hence, the light field “collapses” to the usual 2D texture-map 
defined over an object’s surface. Conversely, if the surface is totally specular (e.g., mirrored), 
each surface point reflects a miniature copy of the environment surrounding that point. In the 
absence of inter-reflections (e.g., a convex object in a large open space), each surface point 
simply reflects the far-field environment map (Section 2.2.1), which again is two-dimensional. 
Therefore, is seems that re-parameterizing the 4D light field to lie on the object’s surface can 
be extremely beneficial. 


These observations underlie the surface light field representation introduced by Wood, 
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Figure 14.13 Surface light fields (Wood, Azuma et al. 2000) © 2000 ACM: (a) example ofa 
highly specular object with strong inter-reflections; (b) the surface light field stores the light 


emanating from each surface point in all visible directions as a “Lumisphere”. 


Azuma et al. (2000). In their system, an accurate 3D model is built of the object being rep- 
resented. Then the Lumisphere of all rays emanating from each surface point is estimated or 
captured (Figure 14.13). Nearby Lumispheres will be highly correlated and hence amenable 
to both compression and manipulation. 

To estimate the diffuse component of each Lumisphere, a median filtering over all visible 
exiting directions is first performed for each channel. Once this has been subtracted from the 
Lumisphere, the remaining values, which should consist mostly of the specular components, 
are reflected around the local surface normal (2.90), which turns each Lumisphere into a copy 
of the local environment around that point. Nearby Lumispheres can then be compressed 
using predictive coding, vector quantization, or principal component analysis. 

The decomposition into a diffuse and specular component can also be used to perform 
editing or manipulation operations, such as re-painting the surface, changing the specular 
component of the reflection (e.g., by blurring or sharpening the specular Lumispheres), or 
even geometrically deforming the object while preserving detailed surface appearance. 

In more recent work, Park, Newcombe, and Seitz (2018) use an RGB-D camera to acquire 
a 3D model and its diffuse reflectance layer using min compositing and iteratively reweighted 
least squares, as discussed in Section 9.4.2. They then estimate a simple piecewise-constant 
BRDF model to account for the specular components. In their follow-on Seeing the World in 
a Bag of Chips paper, Park, Holynski, and Seitz (2020) also estimate the specular reflectance 
map, which is a convolution of the environment map with the object’s specular BRDF. Addi- 


tional techniques to estimate spatially varying BRDFs are discussed in Section 13.7.1. 


In summary, surface light fields are a good representation to add realism to scanned 3D 
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object models by modeling their specular properties, thus avoiding the “cardboard” (matte) 
appearance of such models when their reflections are ignored. For larger scenes, especially 
those containing large planar reflectors such as glass windows or glossy tables, modeling 
the reflections as separate layers, as discussed in Sections 9.4.2 and 14.2.1, or as true mirror 


surfaces (Whelan, Goesele et al. 2018), may be more appropriate. 


14.3.3 Application: Concentric mosaics 


A useful and simple version of light field rendering is a panoramic image with parallax, 1.e., a 
video or series of photographs taken from a camera swinging in front of some rotation point. 
Such panoramas can be captured by placing a camera on a boom on a tripod, or even more 
simply, by holding a camera at arm’s length while rotating your body around a fixed axis. 

The resulting set of images can be thought of as a concentric mosaic (Shum and He 
1999; Shum, Wang et al. 2002) or a layered depth panorama (Zheng, Kang et al. 2007). 
The term “concentric mosaic” comes from a particular structure that can be used to re-bin all 
of the sampled rays, essentially associating each column of pixels with the “radius” of the 
concentric circle to which it is tangent (Ishiguro, Yamamoto, and Tsuji 1992; Shum and He 
1999; Peleg, Ben-Ezra, and Pritch 2001). 

Rendering from such data structures is fast and straightforward. If we assume that the 
scene is far enough away, for any virtual camera location, we can associate each column of 
pixels in the virtual camera with the nearest column of pixels in the input image set. (For 
a regularly captured set of images, this computation can be performed analytically.) If we 
have some rough knowledge of the depth of such pixels, columns can be stretched vertically 
to compensate for the change in depth between the two cameras. If we have an even more 
detailed depth map (Peleg, Ben-Ezra, and Pritch 2001; Li, Shum et al. 2004; Zheng, Kang et 
al. 2007), we can perform pixel-by-pixel depth corrections. 

While the virtual camera’s motion is constrained to lie in the plane of the original cameras 
and within the radius of the original capture ring, the resulting experience can exhibit complex 
rendering phenomena, such as reflections and translucencies, which cannot be captured using 
a texture-mapped 3D model of the world. Exercise 14.10 has you construct a concentric 
mosaic rendering system from a series of hand-held photos or video. 

While concentric mosaics are captured by moving the camera on a (roughly) circular 
arc, it is also possible to construct manifold projections (Peleg and Herman 1997), multiple- 
center-of-projection images (Rademacher and Bishop 1998), and multi-perspective panora- 
mas (Román, Garg, and Levoy 2004; Román and Lensch 2006; Agarwala, Agrawala et al. 
2006; Kopf, Chen et al. 2010), which we discussed briefly in Section 8.2.5. 
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14.3.4 Application: Synthetic re-focusing 


In addition to the interactive viewing of captured scenes and objects, light field rendering can 
be used to add synthetic depth of field effects to photographs (Levoy 2006). In the compu- 
tational photography chapter (Section 10.3.2), we mentioned how the depth estimates pro- 
duced by modern dual-lens and/or dual-pixel smartphones can be used to synthetically blur 
photographs (Wadhwa, Garg et al. 2018; Garg, Wadhwa et al. 2019; Zhang, Wadhwa et al. 
2020). 

When larger numbers of input images are available, e.g., when using microlens arrays, 
the images can be shifted and combined to simulate the effects of a larger aperture lens in 
what is known as synthetic aperture photography (Ng, Levoy et al. 2005; Ng 2005), which 
was the basis of the Lytro light field camera. Related ideas have been used for shallow depth 
of field in light field microscopy (Levoy, Chen et al. 2004; Levoy, Ng et al. 2006), obstruction 
removal (Wilburn, Joshi et al. 2005; Vaish, Szeliski et al. 2006; Xue, Rubinstein et al. 2015; 
Liu, Lai et al. 2020a), and coded aperture photography (Levin, Fergus et al. 2007; Zhou, Lin, 
and Nayar 2009). 


144 Environment mattes 


So far in this chapter, we have dealt with view interpolation and light fields, which are tech- 
niques for modeling and rendering complex static scenes seen from different viewpoints. 

What if, instead of moving around a virtual camera, we take a complex, refractive object, 
such as the water goblet shown in Figure 14.14, and place it in front of a new background? 
Instead of modeling the 4D space of rays emanating from a scene, we now need to model 
how each pixel in our view of this object refracts incident light coming from its environment. 

What is the intrinsic dimensionality of such a representation and how do we go about 
capturing it? Let us assume that if we trace a light ray from the camera at pixel (x, y) toward 
the object, it is reflected or refracted back out toward its environment at an angle (6,0). If 
we assume that other objects and illuminants are sufficiently distant (the same assumption we 
made for surface light fields in Section 14.3.2), this 4D mapping (x,y) — (0, 0) captures 
all the information between a refractive object and its environment. Zongker, Werner ef al. 
(1999) call such a representation an environment matte, as it generalizes the process of object 
matting (Section 10.4) to not only cut and paste an object from one image into another but 
also take into account the subtle refractive or reflective interplay between the object and its 
environment. 


Recall from Equations (3.8) and (10.29) that a foreground object can be represented by 
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Figure 14.14 Environment mattes: (a—b) a refractive object can be placed in front of a 
series of backgrounds and their light patterns will be correctly refracted (Zongker, Werner 
et al. 1999) (c) multiple refractions can be handled using a Gaussian mixture model and (d) 
real-time mattes can be pulled using a single graded colored background (Chuang, Zongker 
et al. 2000) © 2000 ACM. 


its premultiplied colors and opacities (œF, a). Such a matte can then be composited onto a 


new background B using 
Ci = a; F; + (1 = a;)Bi, (14.1) 


where 7 is the pixel under consideration. In environment matting, we augment this equation 
with a reflective or refractive term to model indirect light paths between the environment and 
the camera. In the original work of Zongker, Werner et al. (1999), this indirect component I; 


is modeled as 


I, = R; J A;(x)B(x)dx, (14.2) 


where A; is the rectangular area of support for that pixel, R; is the colored reflectance or 
transmittance (for colored glossy surfaces or glass), and B(x) is the background (environ- 
ment) image, which is integrated over the area A; (x). In follow-on work, Chuang, Zongker 


et al. (2000) use a superposition of oriented Gaussians, 
L => Y Bs I Gi (x) B(x)dx, (14.3) 
j 


where each 2D Gaussian 

Gij (x) = Gop (X; Cij, Gij, Qij) (14.4) 
ES 
ij 
Given a representation for an environment matte, how can we go about estimating it for a 


is modeled by its center c;;, unrotated widths 0, = (02. 0?.), and orientation Dij: 


T 
ay? 


particular object? The trick is to place the object in front of a monitor (or surrounded by a set 
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of monitors), where we can change the illumination patterns B(x) and observe the value of 
each composite pixel C;.*? 

As with traditional two-screen matting (Section 10.4.1), we can use a variety of solid 
colored backgrounds to estimate each pixel’s foreground color a;f; and partial coverage 
(opacity) a;. To estimate the area of support A; in (14.2), Zongker, Werner et al. (1999) use 
a series of periodic horizontal and vertical solid stripes at different frequencies and phases, 
which is reminiscent of the structured light patterns used in active rangefinding (Section 13.2). 
For the more sophisticated Gaussian mixture model (14.3), Chuang, Zongker et al. (2000) 
sweep a series of narrow Gaussian stripes at four different orientations (horizontal, vertical, 
and two diagonals), which enables them to estimate multiple oriented Gaussian responses at 
each pixel. 

Once an environment matte has been “pulled”, it is then a simple matter to replace the 
background with a new image B(x) to obtain a novel composite of the object placed in a 
different environment (Figure 14.14a—c). The use of multiple backgrounds during the matting 
process, however, precludes the use of this technique with dynamic scenes, e.g., water pouring 
into a glass (Figure 14.14d). In this case, a single graded color background can be used to 
estimate a single 2D monochromatic displacement for each pixel (Chuang, Zongker et al. 
2000). 


14.4.1 Higher-dimensional light fields 


As you can tell from the preceding discussion, an environment matte in principle maps every 
pixel (x, y) into a 4D distribution over light rays and is, hence, a six-dimensional representa- 
tion. (In practice, each 2D pixel’s response is parameterized using a dozen or so parameters, 
e.g., {F,a, B, R, A}, instead of a full mapping.) What if we want to model an object's re- 
fractive properties from every potential point of view? In this case, we need a mapping from 
every incoming 4D light ray to every potential exiting 4D light ray, which is an 8D represen- 
tation. If we use the same trick as with surface light fields, we can parameterize each surface 
point by its 4D BRDF to reduce this mapping back down to 6D, but this loses the ability to 
handle multiple refractive paths. 

If we want to handle dynamic light fields, we need to add another temporal dimension. 
(Wenger, Gardner et al. (2005) gives a nice example of a dynamic appearance and illumina- 
tion acquisition system.) Similarly, if we want a continuous distribution over wavelengths, 


this becomes another dimension. 


121f we relax the assumption that the environment is distant, the monitor can be placed at several depths to estimate 


a depth-dependent mapping function (Zongker, Werner et al. 1999). 
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Figure 14.15 The geometry—image continuum in image-based rendering (Kang, Szeliski, 
and Anandan 2000) © 2000 IEEE. Representations at the left of the spectrum use more de- 
tailed geometry and simpler image representations, while representations and algorithms on 


the right use more images and less geometry. 


These examples illustrate how modeling the full complexity of a visual scene through 
sampling can be extremely expensive. Fortunately, constructing specialized models, which 
exploit knowledge about the physics of light transport along with the natural coherence of 


real-world objects, can make these problems more tractable. 


14.4.2 The modeling to rendering continuum 


The image-based rendering representations and algorithms we have studied in this chapter 
span a continuum ranging from classic 3D texture-mapped models all the way to pure sam- 
pled ray-based representations such as light fields (Figure 14.15). Representations such as 
view-dependent texture maps and Lumigraphs still use a single global geometric model, but 
select the colors to map onto these surfaces from nearby images. View-dependent geometry, 
e.g., multiple depth maps, sidestep the need for coherent 3D geometry, and can sometimes 
better model local non-rigid effects such as specular motion (Swaminathan, Kang et al. 2002; 
Criminisi, Kang et al. 2005). Sprites with depth and layered depth images use image-based 
representations of both color and geometry and can be efficiently rendered using warping 
operations rather than 3D geometric rasterization. 

The best choice of representation and rendering algorithm depends on both the quantity 
and quality of the input imagery as well as the intended application. When nearby views are 
being rendered, image-based representations capture more of the visual fidelity of the real 


world because they directly sample its appearance. On the other hand, if only a few input 


14.5 Video-based rendering 887 


images are available or the image-based models need to be manipulated, e.g., to change their 
shape or appearance, more abstract 3D representations such as geometric and local reflection 
models are a better fit. As we continue to capture and manipulate increasingly larger quan- 
tities of visual data, research into these aspects of image-based modeling and rendering will 


continue to evolve. 


14.5 Video-based rendering 


As multiple images can be used to render new images or interactive experiences, can some- 
thing similar be done with video? In fact, a fair amount of work has been done in the area 
of video-based rendering and video-based animation, two terms first introduced by Schódl, 
Szeliski et al. (2000) to denote the process of generating new video sequences from captured 
video footage. An early example of such work is Video Rewrite (Bregler, Covell, and Slaney 
1997), in which archival video footage is “re-animated” by having actors say new utterances 
(Figure 14.16). More recently, the term video-based rendering has been used by some re- 
searchers to denote the creation of virtual camera moves from a set of synchronized video 
cameras placed in a studio (Magnor 2005). (The terms free-viewpoint video and 3D video are 
also sometimes used: see Section 14.5.4.) 


In this section, we present a number of video-based rendering systems and applications. 
We start with video-based animation (Section 14.5.1), in which video footage is re-arranged 
or modified, e.g., in the capture and re-rendering of facial expressions. A special case of this 
is video textures (Section 14.5.2), in which source video is automatically cut into segments 
and re-looped to create infinitely long video animations. It is also possible to create such 
animations from still pictures or paintings, by segmenting the image into separately moving 
regions and animating them using stochastic motion fields (Section 14.5.3). 


Next, we turn our attention to 3D video (Section 14.5.4), in which multiple synchronized 
video cameras are used to film a scene from different directions. The source video frames can 
then be re-combined using image-based rendering techniques, such as view interpolation, to 
create virtual camera paths between the source cameras as part of a real-time viewing expe- 
rience. Finally, we discuss capturing environments by driving or walking through them with 
panoramic video cameras to create interactive video-based walkthrough experiences (Sec- 
tion 14.5.5). 
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Figure 14.16 Video Rewrite (Bregler, Covell, and Slaney 1997) O 1997 ACM: the video 


frames are composed from bits and pieces of old video footage matched to a new audio track. 


14.5.1 Video-based animation 


As we mentioned above, an early example of video-based animation is Video Rewrite, in 
which frames from original video footage are rearranged to match them to novel spoken 
utterances, e.g., for movie dubbing (Figure 14.16). This is similar in spirit to the way that 
concatenative speech synthesis systems work (Taylor 2009). 

In their system, Bregler, Covell, and Slaney (1997) first use speech recognition to ex- 
tract phonemes from both the source video material and the novel audio stream. Phonemes 
are grouped into triphones (triplets of phonemes), as these better model the coarticulation 
effect present when people speak. Matching triphones are then found in the source footage 
and audio track. The mouth images corresponding to the selected video frames are then 
cut and pasted into the desired video footage being re-animated or dubbed, with appropriate 
geometric transformations to account for head motion. During the analysis phase, features 
corresponding to the lips, chin, and head are tracked using computer vision techniques. Dur- 
ing synthesis, image morphing techniques are used to blend and stitch adjacent mouth shapes 
into a more coherent whole. In subsequent work, Ezzat, Geiger, and Poggio (2002) describe 
how to use a multidimensional morphable model (Section 13.6.2) combined with regularized 
trajectory synthesis to improve these results. 

A more sophisticated version of this system, called face transfer, uses a novel source 
video, instead of just an audio track, to drive the animation of a previously captured video, 
i.e., to re-render a video of a talking head with the appropriate visual speech, expression, 
and head pose elements (Vlasic, Brand et al. 2005). This work is one of many performance- 
driven animation systems (Section 7.1.6), which are often used to animate 3D facial models 
(Figures 13.23-13.25). While traditional performance-driven animation systems use marker- 
based motion capture (Williams 1990; Litwinowicz and Williams 1994; Ma, Jones et al. 
2008), video footage can now be used directly to control the animation (Buck, Finkelstein 
et al. 2000; Pighin, Szeliski, and Salesin 2002; Zhang, Snavely et al. 2004; Vlasic, Brand et 
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al. 2005; Roble and Zafar 2009; Thies, Zollhofer et al. 2016; Thies, Zollhófer et al. 2018; 
Zollhófer, Thies et al. 2018; Fried, Tewari et al. 2019; Egger, Smith et al. 2020; Tewari, Fried 
et al. 2020). More details on related techniques can also be found in Section 13.6.3 on facial 
animation and Section 14.6 on neural rendering. 

In addition to its most common application to facial animation, video-based animation 
can also be applied to whole body motion (Section 13.6.4), e.g., by matching the flow fields 
between two different source videos and using one to drive the other (Efros, Berg et al. 2003; 
Wang, Liu et al. 2018; Chan, Ginosar et al. 2019). Another approach to video-based rendering 
is to use flow or 3D modeling to unwrap surface textures into stabilized images, which can 
then be manipulated and re-rendered onto the original video (Pighin, Szeliski, and Salesin 
2002; Rav-Acha, Kohli et al. 2008). 


14.5.2 Video textures 


Video-based animation is a powerful means of creating photo-realistic videos by re-purposing 
existing video footage to match some other desired activity or script. What if, instead of 
constructing a special animation or narrative, we simply want the video to continue playing 
in a plausible manner? For example, many websites use images or videos to highlight their 
destinations, e.g., to portray attractive beaches with surf and palm trees waving in the wind. 
Instead of using a static image or a video clip that has a discontinuity when it loops, can we 
transform the video clip into an infinite-length animation that plays forever? 

This idea is the basis of video textures, in which a short video clip can be arbitrarily 
extended by re-arranging video frames while preserving visual continuity (Schédl, Szeliski ef 
al. 2000). The basic problem in creating video textures is how to perform this re-arrangement 
without introducing visual artifacts. Can you think of how you might do this? 

The simplest approach is to match frames by visual similarity (e.g., Lə distance) and to 
jump between frames that appear similar. Unfortunately, if the motions in the two frames 
are different, a dramatic visual artifact will occur (the video will appear to “stutter”). For 
example, if we fail to match the motions of the clock pendulum in Figure 14.17a, it can 
suddenly change direction in mid-swing. 

How can we extend our basic frame matching to also match motion? In principle, we 
could compute optical flow at each frame and match this. However, flow estimates are often 
unreliable (especially in textureless regions) and it is not clear how to weight the visual and 
motion similarities relative to each other. As an alternative, Schódl, Szeliski et al. (2000) 
suggest matching triplets or larger neighborhoods of adjacent video frames, much in the 
same way as Video Rewrite matches triphones. Once we have constructed an n x n similarity 


matrix between all video frames (where n is the number of frames), a simple finite impulse 
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Figure 14.17 Video textures (Schódl, Szeliski et al. 2000) © 2000 ACM: (a) a clock pen- 
dulum, with correctly matched direction of motion; (b) a candle flame, showing tempo- 
ral transition arcs; (c) the flag is generated using morphing at jumps; (d) a bonfire uses 
longer cross-dissolves; (e) a waterfall cross-dissolves several sequences at once; (f) a smil- 
ing animated face; (g) two swinging children are animated separately; (h) the balloons 
are automatically segmented into separate moving regions; (i) a synthetic fish tank con- 
sisting of bubbles, plants, and fish. Videos corresponding to these images can be found at 


https: //www.cc.gatech.edu/gvu/perception/projects/videotexture. 
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response (FIR) filtering of each match sequence can be used to emphasize subsequences that 
match well. 

The results of this match computation gives us a jump table or, equivalently, a transition 
probability between any two frames in the original video. This is shown schematically as 
red arcs in Figure 14.17b, where the red bar indicates which video frame is currently be- 
ing displayed, and arcs light up as a forward or backward transition is taken. We can view 
these transition probabilities as encoding the hidden Markov model (AMM) that underlies a 
stochastic video generation process. 

Sometimes, it is not possible to find exactly matching subsequences in the original video. 
In this case, morphing, i.e., warping and blending frames during transitions (Section 3.6.3) 
can be used to hide the visual differences (Figure 14.17c). If the motion is chaotic enough, 
as in a bonfire or a waterfall (Figures 14.17d-e), simple blending (extended cross-dissolves) 
may be sufficient. Improved transitions can also be obtained by performing 3D graph cuts on 
the spatio-temporal volume around a transition (Kwatra, Schódl et al. 2003). 

Video textures need not be restricted to chaotic random phenomena such as fire, wind, 
and water. Pleasing video textures can be created of people, e.g., a smiling face (as in Fig- 
ure 14.17f) or someone running on a treadmill (Schódl, Szeliski et al. 2000). When multiple 
people or objects are moving independently, as in Figures 14.17g—h, we must first segment 
the video into independently moving regions and animate each region separately. It is also 
possible to create large panoramic video textures from a slowly panning camera (Agarwala, 
Zheng et al. 2005; He, Liao et al. 2017). 

Instead of just playing back the original frames in a stochastic (random) manner, video 
textures can also be used to create scripted or interactive animations. If we extract individual 
elements, such as fish in a fishtank (Figure 14.171) into separate video sprites, we can animate 
them along prespecified paths (by matching the path direction with the original sprite motion) 
to make our video elements move in a desired fashion (Schódl and Essa 2002). A more recent 
example of controlling video sprites is the Vid2Player system, which models the movements 
and shots of tennis players to create synthetic video-realistic games (Zhang, Sciutto et al. 
2021). In fact, work on video textures inspired research on systems that re-synthesize new 
motion sequences from motion capture data, which some people refer to as “mocap soup” 
(Arikan and Forsyth 2002; Kovar, Gleicher, and Pighin 2002; Lee, Chai et al. 2002; Li, Wang, 
and Shum 2002; Pullen and Bregler 2002). 

While video textures primarily analyze the video as a sequence of frames (or regions) 
that can be re-arranged in time, temporal textures (Szummer and Picard 1996; Bar-Joseph, 
El-Yaniv et al. 2001) and dynamic textures (Doretto, Chiuso et al. 2003; Yuan, Wen et al. 
2004; Doretto and Soatto 2006) treat the video as a 3D spatio-temporal volume with textural 
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Figure 14.18 Animating still pictures (Chuang, Goldman et al. 2005) © 2005 ACM. (a) 
The input still image is manually segmented into (b) several layers. (c) Each layer is then an- 
imated with a different stochastic motion texture (d) The animated layers are then composited 


to produce (e) the final animation 


properties, which can be described using auto-regressive temporal models and combined with 
layered representations (Chan and Vasconcelos 2009). In more recent work, video texture 
authoring systems have been extended to allow for control over the dynamism (amount of 
motion) in different regions (Joshi, Mehta et al. 2012; Liao, Joshi, and Hoppe 2013; Yan, Liu, 
and Furukawa 2017; He, Liao et al. 2017; Oh, Joo et al. 2017) and improved loop transitions 
(Liao, Finch, and Hoppe 2015). 


14.5.3 Application: Animating pictures 


While video textures can turn a short video clip into an infinitely long video, can the same 
thing be done with a single still image? The answer is yes, if you are willing to first segment 
the image into different layers and then animate each layer separately. 

Chuang, Goldman ef al. (2005) describe how an image can be decomposed into separate 
layers using interactive matting techniques. Each layer is then animated using a class-specific 
synthetic motion. As shown in Figure 14.18, boats rock back and forth, trees sway in the 
wind, clouds move horizontally, and water ripples, using a shaped noise displacement map. 
All of these effects can be tied to some global control parameters, such as the velocity and 
direction of a virtual wind. After being individually animated, the layers can be composited 
to create a final dynamic rendering. 

In more recent work, Holynski, Curless et al. (2021) train a deep network to take a static 


photo, hallucinate a plausible motion field, encode the image as deep multi-resolution fea- 
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tures, and then advect these features bi-directionally in time using Eulerian motion, using an 
architecture inspired by Niklaus and Liu (2020) and Wiles, Gkioxari et al. (2020). The result- 
ing deep features are then decoded to produce a looping video clip with synthetic stochastic 


fluid motions. 


14.5.4 3D and free-viewpoint Video 


In the last decade, the 3D movies have become an established medium. Currently, such 
releases are filmed using stereoscopic camera rigs and displayed in theaters (or at home) 
to viewers wearing polarized glasses. In the future, however, home audiences may wish to 
view such movies with multi-zone auto-stereoscopic displays, where each person gets his 
or her own customized stereo stream and can move around a scene to see it from different 
perspectives. 

The stereo matching techniques developed in the computer vision community along with 
image-based rendering (view interpolation) techniques from graphics are both essential com- 
ponents in such scenarios, which are sometimes called free-viewpoint video (Carranza, Theobalt 
et al. 2003) or virtual viewpoint video (Zitnick, Kang et al. 2004). In addition to solving a 
series of per-frame reconstruction and view interpolation problems, the depth maps or prox- 
1es produced by the analysis phase must be temporally consistent in order to avoid flickering 
artifacts. Neural rendering techniques (Tewari, Fried ef al. 2020, Section 6.3) can also be 
used for both the reconstruction and rendering phases. 

Shum, Chan, and Kang (2007) and Magnor (2005) present nice overviews of various 
video view interpolation techniques and systems. These include the Virtualized Reality sys- 
tem of Kanade, Rander, and Narayanan (1997) and Vedula, Baker, and Kanade (2005), Im- 
mersive Video (Moezzi, Katkere et al. 1996), Image-Based Visual Hulls (Matusik, Buehler 
et al. 2000; Matusik, Buehler, and McMillan 2001), and Free-Viewpoint Video (Carranza, 
Theobalt et al. 2003), which all use global 3D geometric models (surface-based (Section 13.3) 
or volumetric (Section 13.5)) as their proxies for rendering. The work of Vedula, Baker, and 
Kanade (2005) also computes scene flow, 1.e., the 3D motion between corresponding surface 
elements, which can then be used to perform spatio-temporal interpolation of the multi-view 
video stream. A more recent variant of scene flow is the occupancy flow work of Niemeyer, 
Mescheder et al. (2019). 

The Virtual Viewpoint Video system of Zitnick, Kang et al. (2004), on the other hand, 
associates a two-layer depth map with each input image, which allows them to accurately 
model occlusion effects such as the mixed pixels that occur at object boundaries. Their sys- 
tem, which consists of eight synchronized video cameras connected to a disk array (Fig- 


ure 14.19a), first uses segmentation-based stereo to extract a depth map for each input image 
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Figure 14.19 Video view interpolation (Zitnick, Kang et al. 2004) © 2004 ACM: (a) the 
capture hardware consists of eight synchronized cameras; (b) the background and foreground 
images from each camera are rendered and composited before blending; (c) the two-layer 
representation, before and after boundary matting; (d) background color estimates; (e) back- 


ground depth estimates; (f) foreground color estimates. 


(Figure 14.19e). Near object boundaries (depth discontinuities), the background layer is ex- 
tended along a strip behind the foreground object (Figure 14.19c) and its color is estimated 
from the neighboring images where it is not occluded (Figure 14.19d). Automated matting 
techniques (Section 10.4) are then used to estimate the fractional opacity and color of bound- 
ary pixels in the foreground layer (Figure 14.19f). 

At render time, given a new virtual camera that lies between two of the original cameras, 
the layers in the neighboring cameras are rendered as texture-mapped triangles and the fore- 
ground layer (which may have fractional opacities) is then composited over the background 
layer (Figure 14.19b). The resulting two images are merged and blended by comparing their 
respective z-buffer values. (Whenever the two z-values are sufficiently close, a linear blend of 
the two colors is computed.) The interactive rendering system runs in real time using regular 
graphics hardware. It can therefore be used to change the observer’s viewpoint while playing 
the video or to freeze the scene and explore it in 3D. Rogmans, Lu et al. (2009) subsequently 
developed GPU implementations of both real-time stereo matching and real-time rendering 
algorithms, which enable them to explore algorithmic alternatives in a real-time setting. 

The depth maps computed from the eight stereo cameras using off-line stereo matching 
have been used in studies of 3D video compression (Smolic and Kauff 2005; Gotchev and 
Rosenhahn 2009; Tech, Chen et al. 2015). Active video-rate depth sensing cameras, such as 
the 3DV Zcam (Iddan and Yahav 2001), which we discussed in Section 13.2.1, are another 
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potential source of such data. 


When large numbers of closely spaced cameras are available, as in the Stanford Light 
Field Camera (Wilburn, Joshi et al. 2005), it may not always be necessary to compute explicit 
depth maps to create video-based rendering effects, although the results are usually of higher 
quality if you do (Vaish, Szeliski et al. 2006). 


The last few years have seen a revival of research into 3D video, spurred in part by the 
wider availability of virtual reality headsets, which can be used to view such videos with a 
strong sense of immersion. The Jump virtual reality capture system from Google (Anderson, 
Gallup et al. 2016) uses 16 GoPro cameras arranged on a 28cm diameter ring to capture 
multiple videos, which are then stitched offline into a pair of omnidirectional stereo (ODS) 
videos (Ishiguro, Yamamoto, and Tsuji 1992; Peleg, Ben-Ezra, and Pritch 2001; Richardt, 
Pritch et al. 2013), which can then be warped at viewing time to produce separate images for 
each eye. A similar system, constructed from tightly synchronized industrial vision cameras, 


was introduced around the same time by Cabral (2016). 


As noted by Anderson, Gallup et al. (2016), however, the ODS representation has severe 
limitations in interactive viewing, e.g., it does not support head tilt, or translational motion, or 
produce correct depth when looking up or down. More recent systems developed by Serrano, 
Kim et al. (2019), Parra Pozo, Toksvig et al. (2019), and Broxton, Flynn et al. (2020) support 
full 6DoF (six degrees of freedom) video, which allows viewers to move within a bounded 
volume while producing perspectively correct images for each eye. However, they require 
multi-view stereo matching during the offline construction phase to produce the 3D proxies 
need to support such viewing. 


While these systems are designed to capture inside out experiences, where a user can 
watch a video unfolding all around them, pointing the cameras outside in can be used to 
capture one or more actors performing an activity (Kanade, Rander, and Narayanan 1997; 
Joo, Liu et al. 2015; Tang, Dou et al. 2018). Such setups are often called free-viewpoint video 
or volumetric performance capture systems. The most recent versions of such systems use 
deep networks to reconstruct, represent, compress, and/or render time-evolving volumetric 
scenes (Martin-Brualla, Pandey et al. 2018; Pandey, Tkach et al. 2019; Lombardi, Simon et 
al. 2019; Tang, Singh et al. 2020; Peng, Zhang et al. 2021), as summarized in the recent 
survey on neural rendering by Tewari, Fried et al. (2020, Section 6.3). And while most 
of these systems require custom-built multi-camera rigs, it is also possible to construct 3D 
videos from collections of handheld videos (Bansal, Vo et al. 2020) or even a single moving 


smartphone camera (Yoon, Kim et al. 2020; Luo, Huang et al. 2020). 


896 Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


14.5.5 Application: Video-based walkthroughs 


Video camera arrays enable the simultaneous capture of 3D dynamic scenes from multiple 
viewpoints, which can then enable the viewer to explore the scene from viewpoints near the 
original capture locations. What if, instead we wish to capture an extended area, such as a 
home, a movie set, or even an entire city? 

In this case, 1t makes more sense to move the camera through the environment and play 
back the video as an interactive video-based walkthrough. To allow the viewer to look around 
in all directions, it is preferable to use a panoramic video camera (Uyttendaele, Criminisi ef 
al. 2004).!3 

One way to structure the acquisition process is to capture these images in a 2D horizontal 
plane, e.g., over a grid superimposed inside a room. The resulting sea of images (Aliaga, 
Funkhouser et al. 2003) can be used to enable continuous motion between the captured lo- 
cations.'* However, extending this idea to larger settings, e.g., beyond a single room, can 
become tedious and data-intensive. 

Instead, a natural way to explore a space is often to just walk through it along some 
prespecified paths, just as museums or home tours guide users along a particular path, say 
down the middle of each room.'* Similarly, city-level exploration can be achieved by driving 
down the middle of each street and allowing the user to branch at each intersection. This idea 
dates back to the Aspen MovieMap project (Lippman 1980), which recorded analog video 
taken from moving cars onto videodiscs for later interactive playback. 

Improvements in video technology enabled the capture of panoramic (spherical) video 
using a small co-located array of cameras, such as the Point Grey Ladybug camera (Fig- 
ure 14.20b) developed by Uyttendaele, Criminisi et al. (2004) for their interactive video-based 
walkthrough project. In their system, the synchronized video streams from the six cameras 
(Figure 14.20a) are stitched together into 360° panoramas using a variety of techniques de- 
veloped specifically for this project. 

Because the cameras do not share the same center of projection, parallax between the 
cameras can lead to ghosting in the overlapping fields of view (Figure 14.20c). To remove 
this, a multi-perspective plane sweep stereo algorithm is used to estimate per-pixel depths at 
each column in the overlap area. To calibrate the cameras relative to each other, the camera 


is spun in place and a constrained structure from motion algorithm (Figure 11.15) is used to 


13 See https://www.cis.upenn.edu/~kostas/omni.html for descriptions of panoramic (omnidirectional) vision sys- 
tems and associated workshops. 

'4The Photo Tourism system of Snavely, Seitz, and Szeliski (2006) applies this idea to less structured collections. 

'5Tn computer games, restricting a player to forward and backward motion along predetermined paths is called 


rail-based gaming. 
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Figure 14.20 Video-based walkthroughs (Uyttendaele, Criminisi et al. 2004) © 2004 IEEE: 
(a) system diagram of video pre-processing; (b) the Point Grey Ladybug camera; (c) ghost 
removal using multi-perspective plane sweep; (d) point tracking, used both for calibration 
and stabilization; (e) interactive garden walkthrough with map below; (f) overhead map 
authoring and sound placement; (g) interactive home walkthrough with navigation bar (top) 
and icons of interest (bottom). 
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estimate the relative camera poses and intrinsics. Feature tracking is then run on the walk- 
through video to stabilize the video sequence. Liu, Gleicher et al. (2009), Kopf, Cohen, and 
Szeliski (2014), and Kopf (2016) have carried out more recent work along these lines. 

Indoor environments with windows, as well as sunny outdoor environments with strong 
shadows, often have a dynamic range that exceeds the capabilities of video sensors. For this 
reason, the Ladybug camera has a programmable exposure capability that enables the brack- 
eting of exposures at subsequent video frames. To merge the resulting video frames into high 
dynamic range (HDR) video, pixels from adjacent frames need to be motion-compensated 
before being merged (Kang, Uyttendaele et al. 2003). 

The interactive walk-through experience becomes much richer and more navigable if an 
overview map is available as part of the experience. In Figure 14.20f, the map has annotations, 
which can show up during the tour, and localized sound sources, which play (with different 
volumes) when the viewer is nearby. The process of aligning the video sequence with the 
map can be automated using a process called map correlation (Levin and Szeliski 2004). 

All of these elements combine to provide the user with a rich, interactive, and immersive 
experience. Figure 14.20e shows a walk through the Bellevue Botanical Gardens, with an 
overview map in perspective below the live video window. Arrows on the ground are used to 
indicate potential directions of travel. The viewer simply orients their view towards one of 
the arrows (the experience can be driven using a game controller) and “walks” forward along 
the desired path. 

Figure 14.20g shows an indoor home tour experience. In addition to a schematic map 
in the lower left corner and adjacent room names along the top navigation bar, icons appear 
along the bottom whenever items of interest, such as a homeowner’s art pieces, are visible 
in the main window. These icons can then be clicked to provide more information and 3D 
views. 

The development of interactive video tours spurred a renewed interest in 360° video-based 
virtual travel and mapping experiences, as evidenced by commercial sites such as Google’s 
Street View and 360cities. The same videos can also be used to generate turn-by-turn driving 
directions, taking advantage of both expanded fields of view and image-based rendering to 
enhance the experience (Chen, Neubert et al. 2009). 

While initially, 360° cameras were exotic and expensive, they have more recently be- 
come widely available consumer products, such as the popular RICOH THETA camera, first 
introduced in 2013, and the GoPro MAX action camera. When shooting 360° videos, it is 
possible to stabilize the video using algorithms tailored to such videos (Kopf 2016) or pro- 
prietary algorithms based on the camera’s IMU readings. And while most of these cameras 


produce monocular photos and videos, VR180 cameras have two lenses and so can create 
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Figure 14,21 First-person hyperlapse video creation (Kopf, Cohen, and Szeliski 2014) © 
2014 ACM: (a) 3D camera path and point cloud recovery, followed by smooth path planning; 
(b) 3D per-camera proxy estimation; and (c) source frame and seam selection using an MRF 
and Poisson blending. 


wide field-of-view stereoscopic content. It is even possible to produce 3D 360° content by 
carefully stitching and transforming two 360° camera streams (Matzen, Cohen et al. 2017). 

In addition to capturing immersive photos and videos of scenic locations and popular 
events, 360° and regular action cameras can also be worn, moved through an environment, 
and then sped up to create hyperlapse videos (Kopf, Cohen, and Szeliski 2014). Because such 
videos may exhibit large amounts of translational motion and parallax when heavily sped up, 
it is insufficient to simply compensate for camera rotations or even to warp individual input 
frames, because the large amounts of compensating motion may force the virtual camera to 
look outside the video frames. Instead, after constructing a sparse 3D model and smoothing 
the camera path, keyframes are selected and 3D proxies are computed for each of these by 
interpolating the sparse 3D point cloud, as shown in Figure 14.21. These frames are then 
warped and stitched together (using Poisson blending) using a Markov random field to ensure 
as much smoothness and visual continuity as possible. This system combines many different 
previously developed 3D modeling, computational photography, and image-based rendering 
algorithms to produce remarkably smooth high-speed tours of large-scale environments (such 
as cities) and activities (such as rock climbing and skiing). 

As we continue to capture more and more of our real world with large amounts of high- 
quality imagery and video, the interactive modeling, exploration, and rendering techniques 
described in this chapter will play an even bigger role in bringing virtual experiences based 
in remote areas of the world as well as re-living special memories closer to everyone. 


14.6 Neural rendering 


The most recent development in image-based rendering is the introduction of deep neural 


networks into both the modeling (construction) and viewing parts of image-based rendering 
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pipelines. Neural rendering has been applied in a number of different domains, including 
style and texture manipulation and 2D semantic photo synthesis (Sections 5.5.4 and 10.5.3), 
3D object shape and appearance modeling (Section 13.5.1), facial animation and reenact- 
ment (Section 13.6.3), 3D body capture and replay (Section 13.6.4), novel view synthesis 
(Section 14.1), free-viewpoint video (Section 14.5.4), and relighting (Duchéne, Riant et al. 
2015; Meka, Haene et al. 2019; Philip, Gharbi et al. 2019; Sun, Barron et al. 2019; Zhou, 
Hadap et al. 2019; Zhang, Barron et al. 2020). 

A comprehensive survey of all of these applications and techniques can be found in the 


state of the art report by Tewari, Fried et al. (2020), whose abstract states: 


Neural rendering is a new and rapidly emerging field that combines genera- 
tive machine learning techniques with physical knowledge from computer graph- 
ics, e.g., by the integration of differentiable rendering into network training. With 
a plethora of applications in computer graphics and vision, neural rendering is 


poised to become a new area in the graphics community... 


The survey contains over 230 references and highlights 46 representative papers, grouped into 
six general categories, namely semantic photo synthesis, novel view synthesis, free viewpoint 
video, relighting, facial reenactment, and body reenactment. As you can tell, these categories 
overlap with the sections of the book mentioned in the previous paragraph. A set of lectures 
based on this content can be found in the related CVPR tutorial on neural rendering (Tewari, 
Zollhófer et al. 2020), and several of the lectures in the TUM Al Guest Lecture Series are 
also on neural rendering research.!'? The X-Fields paper by Bemana, Myszkowski et al. 
(2020, Table 1) also has a nice tabulation of related space, time, and illumination interpolation 
papers with an emphasis on deep methods, while the short bibliography by Dellaert and Yen- 
Chen (2021) summarizes even more recent techniques. Some neural rendering systems are 
implemented using differentiable rendering, which is surveyed by Kato, Beker et al. (2020). 
As we have already seen many of these neural rendering techniques in the previous sec- 
tions mentioned above, we focus here on their application to 3D image-based modeling and 
rendering. There are many ways to organize the last few years’ worth of research in neural 
rendering. In this section, I have chosen to use four broad categories of underlying 3D repre- 
sentations, which we have studied in the last two chapters, namely: texture-mapped meshes, 


depth images and layers, volumetric grids, and implicit functions. 


Texture-mapped meshes. As described in Chapter 13, a convenient representation for 
modeling and rendering a 3D scene is a triangle mesh, which can be reconstructed from 


'6https://niessner.github.io/TUM-AI-Lecture- Series 
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Figure 14.22 Examples of neural image-based rendering: (a) deep blending of depth- 
warped source images (Hedman, Philip et al. 2018) O 2018 ACM; (b) neural re-rendering in 
the wild with controllable view and lighting (Meshry, Goldman et al. 2019) O 2019 IEEE; (c) 
crowdsampling the plenoptic function with a deep MPI (Li, Xian et al. 2020) O 2020 Springer. 
(d) SynSin: novel view synthesis from a single image (Wiles, Gkioxari et al. 2020) O 2020 
IEEE. 
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images using multi-view stereo. One of the earliest papers to use a neural network as part of 
the 3D rendering process was the deep blending system of Hedman, Philip et al. (2018), who 
augment an unstructured Lumigraph rendering pipeline (Buehler, Bosse et al. 2001) with a 
deep neural network that computes the per-pixel blending weights for the warped images se- 
lected for each novel view, as shown in Figure 14.22a. LookinGood (Martin-Brualla, Pandey 
et al. 2018) takes a single or multiple-image texture-mapped 3D upper or whole-body render- 
ing and fills in the holes, denoises the appearance, and increases the resolution using a U-Net 
trained on held out views. Along a similar line, Deep Learning Super Sampling (DLSS) 
uses an encoder-decoder DNN implemented in GPU hardware to increase the resolution of 
rendered games in real time (Burnes 2020). 

While these systems warp colored textures or images (i.e., view-dependent textures) and 
then apply a neural net post-process, it is also possible to first convert the images into a “neu- 
ral” encoding and then warp and blend such representations. Free View Synthesis (Riegler 
and Koltun 2020a) starts by building a local 3D model for the novel view using multi-view 
stereo. It then encodes the source images as neural codes, reprojects these codes to the novel 
viewpoint, and composites them using a recurrent neural network and softmax. Instead of 
warping neural codes at render time and then blending and decoding them, the follow-on 
Stable View Synthesis system (Riegler and Koltun 2020b) collects neural codes from all in- 
coming rays for every surface point and then combines these with an on-surface aggregation 
network to produce outgoing neural codes along the rays to the novel view camera. Deferred 
Neural Rendering (Thies, Zollhófer, and Niefner 2019) uses a (u, v) parameterization over 
the 3D surface to learn and store a 2D texture map of neural codes, which can be sampled 
and decoded at rendering time. 


Depth images and layers. To deal with images taken at different times of day and weather, 
i.e., “in the wild”, Meshry, Goldman et al. (2019) use a DNN to compute a latent “appear- 
ance” vector for each input image and its associated depth image (computed using traditional 
multi-view stereo), as shown in Figure 14.22b. At render time, the appearance can be manip- 
ulated (in addition to the 3D viewpoint) to explore the range of conditions under which the 
images were taken. Li, Xian ef al. (2020) develop a related pipeline (Figure 14.22c), which 
instead of storing a single “deep” color/depth/appearance image or buffer uses a multiplane 
image (MPI). As with the previous system, an encoder-decoder modulated with the appear- 
ance vector (using Adaptive Instance Normalization) is used to render the final image, in this 
case through an intermediate MPI that does the view warping and over compositing. Instead 
of using many parallel finely sliced planes, the GeLaTO (Generative Latent Textured Objects) 
system uses a small number of oriented planes (“billboards”) with associated neural textures 
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to model thin transparent objects such as eyeglasses (Martin-Brualla, Pandey et al. 2020). At 
render time, these textures are warped and then decoded and composited using a U-Net to 
produce a final RGBA sprite. 

While all of these previous systems use multiple images to build a 3D neural represen- 
tation, SynSin (Synthesis from a Single Image) (Wiles, Gkioxari ef al. 2020) starts with just 
a single color image and uses a DNN to turn this image into a neural features F and depth 
D buffer pair, as shown in Figure 14.22d. At render time, the neural features are warped 
according to their associated depths and the camera view matrix, splatted with soft weights, 
and composited back-to-front to obtain a neural rendered frame F, which is then decoded 
into the final color novel view Ig. In Semantic View Synthesis Huang, Tseng et al. (2020) 
start with a semantic label map and use semantic image synthesis (Section 5.5.4) to convert 
this into a synthetic color image and depth map. These are then used to create a multiplane 
image from which novel views can be rendered. Holynski, Curless et al. (2021) train a deep 
network to take a static photo, hallucinate a plausible motion field, encode the image as deep 
features with soft blending weights, advect these features bi-directionally in time, and decode 
the rendered neural feature frames to produce a looping video clip with synthetic stochastic 


fluid motions, as discussed in Section 14.5.3. 


Voxel representations. Another 3D representation that can be used for neural rendering is 
a 3D voxel grid. Figure 14.23 shows the modeling and rendering pipelines from two such 
papers. DeepVoxels (Sitzmann, Thies ef al. 2019) learn a 3D embedding of neural codes 
for a given 3D object. At render time, these are projected into 2D view, filtered through an 
occlusion network (similar to back-to-front alpha compositing), and then decoded into a final 
image. Neural Volumes (Lombardi, Simon et al. 2019) use an encoder-decoder to convert a 
set of multi-view color images into a 3D RGBa volume and an associated volumetric warp 
field that can model facial expression variation. At render time, the color volume is warped 
and then ray marching is used to create a final 2D RGBa foreground image.’ In more recent 
work, Weng, Curless, and Kemelmacher-Shlizerman (2020) show how deformable Neural 
Volumes can be constructed and animated from monocular videos of moving people, such as 
athletes. 


Coordinate-based neural representations. The final representation we discuss in this sec- 
tion are implicit functions implemented using fully connected networks, which are now more 


Note that we mostly use RGBA in earlier parts of the book to denote three color channels with an opacity. In 


the remainder of this section, I use RGBa to be consistent with recent papers. 
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Figure 14.23 Examples of voxel grid neural rendering: (a) DeepVoxels (Sitzmann, Thies 
et al. 2019) © 2019 IEEE; (b) Neural Volumes (Lombardi, Simon et al. 2019) © 2019 ACM. 


commonly known as multilayer perceptrons or MLPs.'* We have already seen the use of [0, 1] 
occupancy and implicit signed distance functions for 3D shape modeling in Section 13.5.1, 
where we mentioned papers such as Occupancy Networks (Mescheder, Oechsle ef al. 2019), 
IM-NET (Chen and Zhang 2019), DeepSDF (Park, Florence et al. 2019), and Convolutional 
Occupancy Networks (Peng, Niemeyer et al. 2020). 

To render colored images, such representations also need to encode the appearance (e.g., 
color, texture, or light field) information at either the surface or throughout the volume. Tex- 
ture Fields (Oechsle, Mescheder et al. 2019) train an MLP conditioned on both 3D shape and 
latent appearance (e.g., car color) to produce a 3D volumetric color field that can then be used 
to texture-map a 3D model, as shown in Figure 14.24a. This representation can be extended 
using differentiable rendering to directly compute depth gradients, as in Differential Volu- 
metric Rendering (DVR) (Niemeyer, Mescheder et al. 2020). Pixel-aligned Implicit function 
(PIFu) networks (Saito, Huang et al. 2019; Saito, Simon et al. 2020) also use MLPs to com- 


18 As Jon Barron and others have pointed out, only signed distance functions actually encode “implicit functions” 
as level-sets of their volumetric values. The more general class of techniques that includes opacity models is often 


called coordinate regression networks or coordinate-based MLPs. 
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Figure 14.24 Examples of implicit function (MLP) neural rendering: (a) Texture Fields 
(Oechsle, Mescheder et al. 2019) © 2019 IEEE; (b) Neural Radiance Fields (Mildenhall, 
Srinivasan et al. 2020) © 2020 Springer. 


pute volumetric inside/outside and color fields and can hallucinate full 3D models from just 
a single color image, as shown in Figure 13.18. Scene representation networks (Sitzmann, 
Zollhöfer, and Wetzstein 2019) use an MLP to map volumetric (x, y, z) coordinates to high- 
dimensional neural features, which are used by both a ray marching LSTM (conditioned on 
the 3D view and output pixel coordinate) and a 1 x 1 color pixel decoder to generate the final 
image. The network can interpolate both appearance and shape latent variables. 


An interesting hybrid system that replaces a trained per-object MLP with on-the-fly multi- 
view stereo matching and image-based rendering is the IBRNet system of Wang, Wang et al. 
(2021). As with other volumetric neural renders, the network evaluates each ray in the novel 
viewpoint image by marching along the ray and computing a density and neural appearance 
feature at each sampled location. However, instead of looking up these values from a pre- 
trained MLP, it samples the neural features from a small number of adjacent input images, 
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much like in Unstructured Lumigraph (Buehler, Bosse et al. 2001; Hedman, Philip et al. 
2018) and Stable View Synthesis (Riegler and Koltun 2020b), which use a precomputed 3D 
surface model (which IBRNet does not). The opacity and appearance values along the ray are 
refined using a transformer architecture, which replaces the more traditional winner-take-all 
module in a stereo matcher, followed by a classic volumetric compositing of the colors and 


densities. 


To model viewpoint dependent effects such as highlights on plastic objects, 1.e., to model 
a full light field (Section 14.3), Neural Radiance Fields (NeRF) extend the implicit mapping 
from (x, y, z) spatial positions to also include a viewing direction (0, LP) as inputs, as shown 
in Figure 14.24b (Mildenhall, Srinivasan et al. 2020). Each (x, y, z) query is first turned into 
a positional encoding that consists of sinusoidal waves at octave frequencies before going 
into a 256-channel MLP. These positional codes are also injected into the fifth layer, and an 
encoding of the viewing direction is injected at the ninth layer, which is where the opacities 
are computed (Mildenhall, Srinivasan ef al. 2020, Figure 7). It turns out that these positional 
encodings are essential to enabling the MLP to represent fine details, as explored in more 
depth by Tancik, Srinivasan et al. (2020), as well as in the SIREN (Sinusoidal Representation 
Network) paper by Sitzmann, Martel et al. (2020), which uses periodic (sinusoidal) activation 


functions. 


It is also possible to pre-train these neural networks, i.e., use meta-learning, on a wider 
class of objects to speed up the optimization task for new images (Sitzmann, Chan et al. 
2020; Tancik, Mildenhall et al. 2021) and also to use cone tracing together with integrated 
positional encoding to reduce aliasing and handle multi-resolution inputs and output (Bar- 
ron, Mildenhall et al. 2021). The NeRF++ paper by Zhang, Riegler et al. (2020) extends 
the original NeRF representation to handle unbounded 3D scenes by adding an “inside-out” 
1/r inverted sphere parameterization, while Neural Sparse Voxel Fields build an octree with 


implicit neural functions inside each non-empty cell (Liu, Gu et al. 2020). 


Instead of modeling opacities, the Implicit Differentiable Renderer (IDR) developed by 
Yariv, Kasten et al. (2020) models a signed distance function, which enables them at rendering 
time to extract a level-set surface with analytic normals, which are then passed to the neural 
renderer, which models viewpoint-dependent effects. The system also automatically adjusts 
input camera positions using differentiable rendering. Neural Lumigraph Rendering uses si- 
nusoidal representation networks to produce more compact representations (Kellnhofer, Jebe 
et al. 2021). They can also export a 3D mesh for much faster view-dependent Lumigraph ren- 
dering. Takikawa, Litalien et al. (2021) also construct an implicit signed distance field, but 
instead of using a single MLP, they build a sparse octree structure that stores neural features 


in cells (much like neural sparse voxel fields) and supports both level of detail and fast sphere 
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tracing. Neural Implicit Surfaces (NeuS) also use a signed distance representation but use a 


rendering formula that better handles surface occlusions (Wang, Liu et al. 2021). 


While NeRF, IDR, and NSVF require a large number of images of a static object taken 
under controlled (uniform lighting) conditions, NeRF in the Wild (Martin-Brualla, Radwan 
et al. 2021) takes an unstructured set of images from a landmark tourist location and not 
only models appearance changes such as weather and time of day but also removes transient 
occluders such as tourists. NeRFs can also be constructed from a single or small number 
of images by conditioning a class-specific neural radiance field on such inputs as in pixel- 
NeRF (Yu, Ye et al. 2020). Deformable neural radiance fields or “nerfies” (Park, Sinha et 
al. 2020), Neural Scene Flow Fields (Li, Niklaus et al. 2021), Dynamic Neural Radiance 
Fields (Pumarola, Corona et al. 2021), Space-time Neural Irradiance Fields (Xian, Huang et 
al. 2021), and HyperNeRF (Park, Sinha et al. 2021) all take as input hand-held videos taken 
around a person or moving through a scene. They model both the viewpoint variation and 
volumetric non-rigid deformations such as head or body movements and expression changes, 
either using a learned deformation field, adding time as an extra input variable, or embedding 
the representation in a higher dimension. 


It is also possible to extend NeRFs to model not only the opacities and view-dependent 
colors of 3D coordinates, but also their interactions with potential illuminants. Neural Re- 
flectance and Visibility Fields (NeRV) do this by also returning for each query 3D coordinate 
a surface normal and parametric BRDF as well as the environment visibility and expected 
termination depth for outgoing rays at that point (Srinivasan, Deng ef al. 2021). Neural Re- 
flection Decomposition (NeRD) models densities and colors using an implicit MLP that also 
returns an appearance vector, which is decoded into a parametric BRDF (Boss, Braun et al. 
2020). It then uses the environmental illumination, approximated using spherical Gaussians, 
along with the density normal and BRDF, to render the final color sample at that voxel. PhySG 
uses a similar approach, using a signed distance field to represent the shape and a mixture of 
spherical Gaussian to represent the BRDF (Zhang, Luan et al. 2021). 


Most of the neural rendering techniques that include view-dependent effects are quite 
slow to render, since they require sampling a volumetric space along each ray, using ex- 
pensive MLPs to perform each location/direction lookup. To achieve real-time rendering 
while modeling view-dependent effects, a number of recent papers use efficient spatial data 
structures (octrees, sparse grids, or multiplane images) to store opacities and base colors (or 
potentially small MLPs) and then use factored approximations of the radiance field to model 
view-dependent effects (Wizadwongsa, Phongthawee et al. 2021; Garbin, Kowalski et al. 
2021; Reiser, Peng et al. 2021; Yu, Li et al. 2021; Hedman, Srinivasan et al. 2021). While 


the exact details of the representations used in the various stages vary amongst these papers, 
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they all start with high-fidelity view-dependent models related to the original NeRF paper or 
its extensions and then “bake” or “distill” these into faster to evaluate spatial data structures 
and simplified (but still accurate) view-dependent models. The resulting systems produce the 
same high fidelity renderings as full Neural Radiance Fields while running often 1000x faster 
than pure MLP-based representations. 

As you can tell from the brief discussion in this section, neural rendering is an extremely 
active research area with new architectures being proposed every few months (Dellaert and 
Yen-Chen 2021). The best place to find the latest developments, as with other topics in com- 
puter vision, is to look on arXiv and in the leading computer vision, graphics, and machine 


learning conferences. 


14.7 Additional reading 


Two good surveys of image-based rendering are by Kang, Li et al. (2006) and Shum, Chan, 
and Kang (2007), with earlier surveys available from Kang (1999), McMillan and Gortler 
(1999), and Debevec (1999). Today, the field often goes under the name of novel view syn- 
thesis (NVS), with a recent tutorial at CVPR (Gallo, Troccoli et al. 2020) providing a good 
overview of historical and current techniques. 

The term image-based rendering was introduced by McMillan and Bishop (1995), al- 
though the seminal paper in the field is the view interpolation paper by Chen and Williams 
(1993). Debevec, Taylor, and Malik (1996) describe their Façade system, which not only 
created a variety of image-based modeling tools but also introduced the widely used tech- 
nique of view-dependent texture mapping. Early work on planar impostors and layers was 
carried out by Shade, Lischinski et al. (1996), Lengyel and Snyder (1997), and Torborg and 
Kajiya (1996), while newer work based on sprites with depth is described by Shade, Gortler 
et al. (1998). Using a large number of parallel planes with RGBA colors and opacities (origi- 
nally dubbed the “stack of acetates” model by Szeliski and Golland (1999)) was rediscovered 
by Zhou, Tucker ef al. (2018) and now goes by the name of multiplane images (MPI). This 
representation is widely used in recent 3D capture and rendering pipelines (Mildenhall, Srini- 
vasan et al. 2019; Choi, Gallo et al. 2019; Broxton, Flynn et al. 2020; Attal, Ling et al. 2020; 
Lin, Xu et al. 2020). To accurately model reflections, the alpha-compositing operator used in 
MPIs needs to be replaced with an additive model, as in Sinha, Kopf et al. (2012) and Kopf, 
Langguth et al. (2013). 

The two foundational papers in image-based rendering are Light field rendering by Levoy 
and Hanrahan (1996) and The Lumigraph by Gortler, Grzeszczuk et al. (1996). Buehler, 
Bosse et al. (2001) generalize the Lumigraph approach to irregularly spaced collections of 
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images, while Levoy (2006) provides a survey and more gentle introduction to the topic of 
light field and image-based rendering. Wu, Masia et al. (2017) provide a more recent survey 
of this topic. More recently, neural rendering techniques have been used to improve the 
blending heuristics used in the Unstructured Lumigraph (Hedman, Philip et al. 2018; Riegler 
and Koltun 2020a). 


Surface light fields (Wood, Azuma et al. 2000; Park, Newcombe, and Seitz 2018; Yariv, 
Kasten et al. 2020) provide an alternative parameterization for light fields with accurately 
known surface geometry and support both better compression and the possibility of editing 
surface properties. Concentric mosaics (Shum and He 1999; Shum, Wang et al. 2002) and 
panoramas with depth (Peleg, Ben-Ezra, and Pritch 2001; Li, Shum et al. 2004; Zheng, Kang 
et al. 2007), provide useful parameterizations for light fields captured with panning cameras. 
Multi-perspective images (Rademacher and Bishop 1998) and manifold projections (Peleg 
and Herman 1997), although not true light fields, are also closely related to these ideas. 


Among the possible extensions of light fields to higher-dimensional structures, environ- 
ment mattes (Zongker, Werner et al. 1999; Chuang, Zongker et al. 2000) are the most useful, 
especially for placing captured objects into new scenes. 


Video-based rendering, i.e., the re-use of video to create new animations or virtual expe- 
riences, started with the seminal work of Szummer and Picard (1996), Bregler, Covell, and 
Slaney (1997), and Schódl, Szeliski et al. (2000). Important follow-on work to these ba- 
sic re-targeting approaches includes Schédl and Essa (2002), Kwatra, Schédl et al. (2003), 
Doretto, Chiuso et al. (2003), Wang and Zhu (2003), Zhong and Sclaroff (2003), Yuan, Wen 
et al. (2004), Doretto and Soatto (2006), Zhao and Pietikáinen (2007), Chan and Vasconcelos 
(2009), Joshi, Mehta et al. (2012), Liao, Joshi, and Hoppe (2013), Liao, Finch, and Hoppe 
(2015), Yan, Liu, and Furukawa (2017), He, Liao et al. (2017), and Oh, Joo et al. (2017). 
Related techniques have also been used for performance driven video animation (Zollhófer, 
Thies et al. 2018; Fried, Tewari et al. 2019; Chan, Ginosar et al. 2019; Egger, Smith et al. 
2020). 


Systems that allow users to change their 3D viewpoint based on multiple synchronized 
video streams include Moezzi, Katkere et al. (1996), Kanade, Rander, and Narayanan (1997), 
Matusik, Buehler et al. (2000), Matusik, Buehler, and McMillan (2001), Carranza, Theobalt 
et al. (2003), Zitnick, Kang et al. (2004), Magnor (2005), Vedula, Baker, and Kanade (2005), 
Joo, Liu et al. (2015), Anderson, Gallup et al. (2016), Tang, Dou et al. (2018), Serrano, Kim 
et al. (2019), Parra Pozo, Toksvig et al. (2019), Bansal, Vo et al. (2020), Broxton, Flynn et 
al. (2020), and Tewari, Fried et al. (2020). 3D (multi-view) video coding and compression 
is also an active area of research (Smolic and Kauff 2005; Gotchev and Rosenhahn 2009), 
and is used in 3D Blu-Ray discs and multi-view video coding (MVC) extensions to the High 
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Efficientcy Video Coding (HEVC) standard (Tech, Chen et al. 2015). 

The whole field of neural rendering is quite recent, with initial publications focusing on 
2D image synthesis (Zhu, Krahenbiihl et al. 2016; Isola, Zhu et al. 2017) and only more re- 
cently being applied to 3D novel view synthesis (Hedman, Philip ef al. 2018; Martin-Brualla, 
Pandey et al. 2018). Tewari, Fried et al. (2020) provide an excellent survey of this area, with 
230 references and 46 highlighted papers. Additional overviews include the related CVPR 
tutorial on neural rendering (Tewari, Zollhófer et al. 2020), several of the lectures in the TUM 
AI Guest Lecture Series, the X-Fields paper by Bemana, Myszkowski et al. (2020, Table 1), 
and a recent bibliography by Dellaert and Yen-Chen (2021). 


14.8 Exercises 


Ex 14.1: Depth image rendering. Develop a “view extrapolation” algorithm to re-render a 


previously computed stereo depth map coupled with its corresponding reference color image. 


1. Use a 3D graphics mesh rendering system such as OpenGL with two triangles per 
pixel quad and perspective (projective) texture mapping (Debevec, Yu, and Borshukov 
1998). 


2. Alternatively, use the one- or two-pass forward warper you constructed in Exercise 3.24, 


extended using (2.68—2.70) to convert from disparities or depths into displacements. 


3. (Optional) Kinks in straight lines introduced during view interpolation or extrapola- 
tion are visually noticeable, which is one reason why image morphing systems let you 
specify line correspondences (Beier and Neely 1992). Modify your depth estimation 
algorithm to match and estimate the geometry of straight lines and incorporate it into 
your image-based rendering algorithm. 


Ex 14.2: View interpolation. Extend the system you created in the previous exercise to 
render two reference views and then blend the images using a combination of z-buffering, 


hole filing, and blending (morphing) to create the final image (Section 14.1). 


1. (Optional) If the two source images have very different exposures, the hole-filled re- 
gions and the blended regions will have different exposures. Can you extend your 


algorithm to mitigate this? 


2. (Optional) Extend your algorithm to perform three-way (trilinear) interpolation be- 
tween neighboring views. You can triangulate the reference camera poses and use 


barycentric coordinates for the virtual camera to determine the blending weights. 
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Ex 14.3: View morphing. Modify your view interpolation algorithm to perform morphs 


between views of a non-rigid object, such as a person changing expressions. 


1. 


Instead of using a pure stereo algorithm, use a general flow algorithm to compute dis- 
placements, but separate them into a rigid displacement due to camera motion and a 
non-rigid deformation. 


. At render time, use the rigid geometry to determine the new pixel location but then add 


a fraction of the non-rigid displacement as well. 


. (Optional) Take a single image, such as the Mona Lisa or a friend’s picture, and create 


an animated 3D view morph (Seitz and Dyer 1996). 
(a) Find the vertical axis of symmetry in the image and reflect your reference image 
to provide a virtual pair (assuming the person’s hairstyle is somewhat symmetric). 
(b) Use structure from motion to determine the relative camera pose of the pair. 
(c) Use dense stereo matching to estimate the 3D shape. 


(d) Use view morphing to create a 3D animation. 


Ex 14.4: View dependent texture mapping. Use a 3D model you created along with the 


original images to implement a view-dependent texture mapping system. 


1. 


Use one of the 3D reconstruction techniques you developed in Exercises 11.10, 12.9, 


12.10, or 13.8 to build a triangulated 3D image-based model from multiple photographs. 


. Extract textures for each model face from your photographs, either by performing the 


appropriate resampling or by figuring out how to use the texture mapping software to 


directly access the source images. 


. For each new camera view, select the best source image for each visible model face. 


. Extend this to blend between the top two or three textures. This is trickier, because 


it involves the use of texture blending or pixel shading (Debevec, Taylor, and Malik 
1996; Debevec, Yu, and Borshukov 1998; Pighin, Hecker et al. 1998). 


Ex 14.5: Layered depth images. Extend your view interpolation algorithm (Exercise 14.2) 


to store more than one depth or color value per pixel (Shade, Gortler et al. 1998), i.e., a lay- 


ered depth image (LDI). Modify your rendering algorithm accordingly. For your data, you 


can use synthetic ray tracing, a layered reconstructed model, or a volumetric reconstruction. 
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Ex 14.6: Rendering from sprites or layers. Extend your view interpolation algorithm to 
handle multiple planes or sprites (Section 14.2.1) (Shade, Gortler et al. 1998). 


1. Extract your layers using the technique you developed in Exercise 9.7. 


2. Alternatively, use an interactive painting and 3D placement system to extract your lay- 
ers (Kang 1998; Oh, Chen et al. 2001; Shum, Sun et al. 2004). 


3. Determine a back-to-front order based on expected visibility or add a z-buffer to your 


rendering algorithm to handle occlusions. 


4. Render and composite all of the resulting layers, with optional alpha matting to handle 


the edges of layers and sprites. 


5. Try one of the newer multiplane image (MPI) techniques (Zhou, Tucker et al. 2018). 


Ex 14.7: Light field transformations. Derive the equations relating regular images to 4D 
light field coordinates. 


1. Determine the mapping between the far plane (u, v) coordinates and a virtual camera’s 


(x, y) coordinates. 
(a) Start by parameterizing a 3D point on the wv plane in terms of its (u, v) coordi- 
nates. 


(b) Project the resulting 3D point to the camera pixels (x, y, 1) using the usual 3 x 4 
camera matrix P (2.63). 


(c) Derive the 2D homography relating (u, v) and (x, y) coordinates. 
2. Write down a similar transformation for (s, t) to (x, y) coordinates. 


3. Prove that if the virtual camera is actually on the (s, t) plane, the (s,¢) value depends 


only on the camera’s image center and is independent of (x, y). 


4. Prove that an image taken by a regular orthographic or perspective camera, i.e., one that 
has a linear projective relationship between 3D points and (x, y) pixels (2.63), samples 
the (s, t, u, v) light field along a two-dimensional hyperplane. 


Ex 14.8: Light field and Lumigraph rendering. Implement a light field or Lumigraph ren- 


dering system: 


1. Download one of the light field datasets from http://lightfield.stanford.edu or https: 
/Mightfield-analysis.uni-konstanz.de. 
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. Write an algorithm to synthesize a new view from this light field, using quadri-linear 


interpolation of (s, t, u, v) ray samples. 


. Try varying the focal plane corresponding to your desired view (Isaksen, McMillan, 


and Gortler 2000) and see if the resulting image looks sharper. 


. Determine a 3D proxy for the objects in your scene. You can do this by running multi- 


view stereo over one of your light fields to obtain a depth map per image. 


. Implement the Lumigraph rendering algorithm, which modifies the sampling of rays 


according to the 3D location of each surface element. 


. Collect a set of images yourself and determine their pose using structure from motion. 


. Implement the unstructured Lumigraph rendering algorithm from Buehler, Bosse et al. 


(2001). 


Ex 14.9: Surface light fields. Construct a surface light field (Wood, Azuma et al. 2000) 
and see how well you can compress it. 


I, 


6. 


7. 


Acquire an interesting light field of a specular scene or object, or download one from 
http://lightfield.stanford.edu or https://lightfield-analysis.uni-konstanz.de. 


. Build a 3D model of the object using a multi-view stereo algorithm that is robust to 


outliers due to specularities. 


. Estimate the Lumisphere for each surface point on the object. 


. Estimate its diffuse components. Is the median the best way to do this? Why not use 


the minimum color value? What happens if there is Lambertian shading on the diffuse 
component? 


. Model and compress the remaining portion of the Lumisphere using one of the tech- 


niques suggested by Wood, Azuma et al. (2000) or invent one of your own. 
Study how well your compression algorithm works and what artifacts it produces. 


(Optional) Develop a system to edit and manipulate your surface light field. 


Ex 14.10: Handheld concentric mosaics. Develop a system to navigate a handheld con- 


centric mosaic. 


li 


Stand in the middle of a room with a camcorder held at arm’s length in front of you and 
spin in a circle. 
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2. Use a structure from motion system to determine the camera pose and sparse 3D struc- 


ture for each input frame. 


3. (Optional) Re-bin your image pixels into a more regular concentric mosaic structure. 


4. At view time, determine from the new camera’s view (which should be near the plane 


of your original capture) which source pixels to display. You can simplify your com- 
putations to determine a source column (and scaling) for each output column. 


5. (Optional) Use your sparse 3D structure, interpolated to a dense depth map, to improve 


your rendering (Zheng, Kang et al. 2007). 


Ex 14.11: Video textures. Capture some videos of natural phenomena, such as a water 


fountain, fire, or smiling face, and loop the video seamlessly into an infinite length video 
(Schédl, Szeliski et al. 2000). 


1. 


Compare all the frames in the original clip using an Lə (sum of square difference) 


metric. (This assumes the videos were shot on a tripod or have already been stabilized.) 


. Filter the comparison table temporally to accentuate temporal sub-sequences that match 


well together. 


. Convert your similarity table into a jump probability table through some exponential 


distribution. Be sure to modify transitions near the end so you do not get “stuck” in the 
last frame. 


. Starting with the first frame, use your transition table to decide whether to jump for- 


ward, backward, or continue to the next frame. 


. (Optional) Add any of the other extensions to the original video textures idea, such 


as multiple moving regions, interactive control, or graph cut spatio-temporal texture 


seaming. 


Ex 14.12: Neural rendering. Most of the recent neural rendering papers come with open 


source code as well as carefully acquired datasets. 


Try downloading more than one of these and run different algorithms on different datasets. 


Compare the quality of the renderings you obtain and list the visual artifacts you detect and 


how you might improve them. 


Try capturing your own dataset, if this is feasible, and describe additional breaking points 


of the current algorithms. 


Chapter 15 


Conclusion 


In this book, we have covered a broad range of computer vision topics. We started with a 
review of basic geometry and optics, as well as mathematical tools such as image and signal 
processing, continuous and discrete optimization, statistical modeling, and machine learning. 
We then used these to develop computer vision algorithms such as image enhancement and 
segmentation, object detection and classification, motion estimation, and 3D shape recon- 
struction. These components, in turn, enabled us to build more complex applications, such as 
large-scale image retrieval, converting images to descriptions, stitching multiple images into 
wider and higher dynamic range composites, tracking people and objects, navigating in new 
unseen environments, and augmenting video with embedded 3D overlays. 

In the decade since the publication of the first edition of this book, the computer vision 
field has exploded, both in the maturity and reliability of vision algorithms, as well as the 
number of practitioners and commercial applications. The most notable advance has been 
in deep learning, which now enables visual recognition at a level of performance that has 
eclipsed what we could do ten years ago. Deep learning has also found widespread applica- 
tion in basic vision algorithms such as image enhancement, motion estimation, and 3D shape 
recovery. Other advances, such as reliable real-time tracking and reconstruction have en- 
abled applications such as autonomous navigation and phone-based augmented reality. And 
advances in sophisticated image processing have produced computational photography algo- 
rithms that run in every mobile phone producing images that surpass the quality available 
with much more expensive traditional photographic equipment. 

You may ask: Why is our field so broad and aren't there any unifying principles that can 
be used to simplify our study? Part of the answer lies in the expansive definition of computer 


vision, which is the capture, analysis, and interpretation of our 3D environment using images 
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and video, as well as the incredible complexity inherent in the formation of visual imagery. In 
some ways, our field is as complex as the study of automotive engineering, which requires an 
understanding of internal combustion, mechanics, aerodynamics, ergonomics, electrical cir- 
cuitry, and control systems, among other topics. Computer vision similarly draws on a wide 
variety of sub-disciplines, which makes it challenging to cover in a one-semester course, 
or even to achieve mastery during a course of graduate studies. Conversely, the incredible 
breadth and technical complexity of computer vision is what draws many people to this re- 
search field. 


Because of this richness and the difficulty in making and measuring progress, I attempt 
to instill in my students, and hopefully in the readers of this book, a discipline founded on 


principles from engineering, science, statistics, and machine learning. 


The engineering approach to problem solving is to first carefully define the overall prob- 
lem being tackled and to question the basic assumptions and goals inherent in this process. 
Once this has been done, a number of alternative solutions or approaches are implemented 
and carefully tested, paying attention to issues such as reliability and computational cost. 
Finally, one or more solutions are deployed and evaluated in real-world settings. For this 
reason, this book contains different alternatives for solving vision problems, many of which 


are sketched out in the exercises for students to implement and test on their own. 


The scientific approach builds upon a basic understanding of physical principles. In the 
case of computer vision, this includes the physics of natural and artificial structures, image 
formation, including lighting and atmospheric effects, optics, and noisy sensors. The task is to 
then invert this formation using stable and efficient algorithms to obtain reliable descriptions 
of the scene and other quantities of interest. The scientific approach also encourages us to 
formulate and test hypotheses, which is similar to the extensive testing and evaluation inherent 
in engineering disciplines. 

Because so much about the image formation process is inherently uncertain and ambigu- 
Ous, a statistical approach, which models both the uncertainty and prior distributions in the 
world, as well as the degradations in the image formation process, is often essential. Bayesian 
inference techniques can then be used to combine prior and measurement models to estimate 
the unknowns and to model their uncertainty. Efficient learning and inference algorithms, 
such as dynamic programming, graph cuts, and belief propagation, often play a crucial role 


in this process. 


Finally, machine learning techniques, driven by large amounts of training data—both 
labeled (supervised) and unlabeled (unsupervised)—enable the development of models that 
can discover hard to describe regularities and patterns in the world, which can make inference 
more reliable. However, despite the incredible advances enabled by learning techniques, we 
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must still remain cautious about the inherent limitations of learning-based approaches, and 
not just slough off problems due to “insufficient or biased training data” as someone else’s 
problem. 

Along these lines, I was inspired by a segment from Shree Nayar’s First Principles of 


Computer Vision online lecture series:! 


Since deep learning is very popular today, you may be wondering if it is worth 
knowing the first principles of vision, or for that matter, the first principles of any 
field. Given a task, why not just train a neural network with tons of data to solve 
the task? 


Indeed, there are applications where such an approach may suffice. But there are 
several reasons to embrace the basics. First, it would be laborious and unneces- 
sary to train a network to learn a phenomenon that can be precisely and concisely 
described using first principles. Second, when a network does not perform well 
enough, first principles are your only hope for understanding why. Third, col- 
lecting data to train a network can be tedious, and sometimes even impractical. 
In such cases, models based on first principles can be used to synthesize the data, 
instead of collecting it. And finally, the most compelling reason to learn the first 
principles of any field is curiosity. What makes humans unique is that innate 


desire to know why things work the way they do. 


Given the breadth of material we have covered in this book, what new developments are 
we likely to see in the future? It seems fairly obvious from the tremendous advances in the last 
decade that machine learning, including the ability to fine-tune architectures and algorithm 
to optimize continuous criteria and metrics, will continue to evolve and produce significant 
improvements. The current dominance of feedforward convolutional architectures, mostly 
using weighted linear summation and simple non-linearities, is likely to evolve to include 
more complex architectures with attention and top-down feedback, as we are already starting 
to see. Sophisticated application-specific imaging sensors will likely start being used more 
often, displacing and enhancing the use of visible light imaging sensors originally developed 
for photography. Integration with additional sensors, such as IMUs and potentially active 
sensing (where power permits) will make classic problems such as real-time localization and 
3D reconstruction much more reliable and ubiquitous. 

The most challenging applications of computer vision will likely remain in the realm 
of artificial general intelligence (AGI), which aims to create systems that exhibit the same 


range of understanding and behaviors as people. Since progress here depends on concurrent 


‘https://fpcv.cs.columbia.edu, Introduction:Overview video 
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progress in many other aspects of artificial intelligence, it will be interesting to see how these 
different AI modalities and capabilities leverage each other for improved performance. 
Whatever the outcome of these research endeavors, computer vision is already having 
a tremendous impact in many areas, including digital photography, visual effects, medical 
imaging, safety and surveillance, image search, product recommendations, and aids for the 
visually impaired. The breadth of the problems and techniques inherent in this field, com- 
bined with the richness of the mathematics and the utility of the resulting algorithms, will 


ensure that this remains an exciting area of study for years to come. 
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In this appendix, we introduce some elements of linear algebra and numerical techniques 
that are used elsewhere in the book. We start with some basic decompositions in matrix al- 
gebra, including the singular value decomposition (SVD), eigenvalue decompositions, and 
other matrix decompositions (factorizations). Next, we look at the problem of linear least 
squares, which can be solved using either the QR decomposition or normal equations. This 
is followed by non-linear least squares, which arise when the measurement equations are not 
linear in the unknowns or when robust error functions are used. Such problems require iter- 
ation to find a solution. Next, we look at direct solution (factorization) techniques for sparse 
problems, where the ordering of the variables may have a large influence on the computation 
and memory requirements. Finally, we discuss iterative techniques for solving large linear 
(or linearized) least squares problems. Good general references for much of this material in- 
clude books by Bjórck (1996), Golub and Van Loan (1996), Trefethen and Bau (1997), Meyer 
(2000), Nocedal and Wright (2006), Bjórck and Dahlquist (2010), and Deisenroth, Faisal, and 
Ong (2020) and the collection of matrix formulas compiled by (Petersen and Pedersen 2012). 


A note on vector and matrix indexing. To be consistent with the rest of the book and 
with the general usage in the computer science and computer vision communities, I adopt 
a 0-based indexing scheme for vector and matrix element indexing. Please note that most 
mathematical textbooks and papers use 1-based indexing, so you need to be aware of the 
differences when you read this book. 


A.1 Matrix decompositions 


To better understand the structure of matrices and more stably perform operations such as 
inversion and system solving, a number of decompositions (or factorizations) can be used. In 
this section, we review singular value decomposition (SVD), eigenvalue decomposition, QR 


factorization, and Cholesky factorization. 
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A.1.1 Singular value decomposition 


One of the most useful decompositions in matrix algebra is the singular value decomposition 


(SVD), which states that any real-valued m x n matrix A can be written as 


Amxn = Urea on (A.1) 
00 Ye 
= lu Up—1 uae : 
Cp-1 vo: 


where p = min(m, n). The matrices U and V are orthonormal, i.e., UTU = I and VTV = 


I, and so are their column vectors, 
The singular values are all non-negative and can be ordered in decreasing order 


O09 > 01 >+**>0p-1 Z 0. (A.3) 


A geometric intuition for the SVD of a matrix A can be obtained by re-writing A = 
UYNV” in (A.1)as 


AV = UX or Avj = juj. (A.4) 


This formula says that the matrix A takes any basis vector v; and maps it to a direction uj 
with length 0;, as shown in Figure A.1 

If only the first r singular values are positive, the matrix A is of rank r and the index p 
in the SVD decomposition (A.1) can be replaced by r. (In other words, we can drop the last 
p — r columns of U and V.) 

An important property of the singular value decomposition of a matrix (also true for 
the eigenvalue decomposition of a real symmetric non-negative definite matrix) is that if we 


truncate the expansion 
t 
T 
A= 5 ojujv?, (A.S) 
j=0 


we obtain the best possible least squares approximation to the original matrix A. This is 
used both in eigenface-based face recognition systems (Section 5.2.3) and in the separable 
approximation of convolution kernels (3.21). 
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Figure A.1 The action of a matrix A can be visualized by thinking of the domain as being 
spanned by a set of orthonormal vectors v;, each of which is transformed to a new orthogonal 
vector uj with a length oj. When A is interpreted as a covariance matrix and its eigenvalue 
decomposition is performed, each of the uj axes denote a principal direction (component) 


and each oj denotes one standard deviation along that direction. 


A.1.2 Eigenvalue decomposition 


If the matrix C is symmetric (m = n),! it can be written as an eigenvalue decomposition, 


Ao ug 
C= UAU = luo © una 
An-1| LUn-1 
n—1 
= Y Amul. (A.6) 
1=0 


(The eigenvector matrix U is sometimes written as ® and the eigenvectors u as ¢.) In this 
case, the eigenvalues 
Ao 2 Az A An—1 (A.7) 


can be both positive and negative.” 
A special case of the symmetric matrix C occurs when it is constructed as the sum of a 
number of outer products 
C= 5 ajal = AAT, (A.8) 
i 


which often occurs when solving least squares problems (Appendix A.2), where the matrix A 
consists of all the a; column vectors stacked side-by-side. In this case, we are guaranteed that 


ln this appendix, we denote symmetric matrices using C and general rectangular matrices using A. 
2Ejgenvalue decompositions can be computed for non-symmetric matrices, but the eigenvalues and eigenvectors 


can have complex entries in that case. 
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all of the eigenvalues A; are non-negative. The associated matrix C is positive semi-definite 
x"Cx>0, Vx. (A.9) 


If the matrix C is of full rank, the eigenvalues are all positive and the matrix is called sym- 
metric positive definite (SPD). 
Symmetric positive semi-definite matrices also arise in the statistical analysis of data, as 


they represent the covariance of a set of {x;} points around their mean x, 


1 = =\T 

=> ds K)(x; — X)". (A.10) 
In this case, performing the eigenvalue decomposition is known as principal component anal- 
ysis (PCA), because it models the principal directions (and magnitudes) of variation of the 
point distribution around their mean, as shown in Section 7.3.1, Section 5.2.3 (5.41), and 
Appendix B.1 (B.10). Figure A.1 shows how the principal components of the covariance 
matrix C denote the principal axes u, of the uncertainty ellipsoid corresponding to this point 

distribution and how the oj = VAj denote the standard deviations along each axis. 
The eigenvalues and eigenvectors of C and the singular values and singular vectors of A 

are closely related. Given 

A=UYNV7, (A.11) 


we get 
C = AAT =USV! VEU” = UAU”. (A.12) 


From this, we see that A; = 0? and that the left singular vectors of A are the eigenvectors of 
C. 

This relationship gives us an efficient method for computing the eigenvalue decomposi- 
tion of large matrices that are rank deficient, such as the scatter matrices observed in comput- 
ing eigenfaces (Section 5.2.3). Observe that the covariance matrix C in (5.41) is exactly the 
same as C in (A.8). Note also that the individual difference-from-mean images a; = x; — X 
are long vectors of length P (the number of pixels in the image), while the total number of ex- 
emplars N (the number of faces in the training database) is much smaller. Instead of forming 
C = AAT, which is P x P, we form the matrix 


Ĉ = ATA, (A.13) 


which is N x N. (This involves taking the dot product between every pair of difference 
images a; and aj.) The eigenvalues of C are the squared singular values of A, namely ©?, 


and are hence also the eigenvalues of C. The eigenvectors of C are the right singular vectors 
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V of A, from which the desired eigenfaces U, which are the left singular vectors of A, can 


be computed as 


U = AVE !. (A.14) 


This final step is essentially computing the eigenfaces as linear combinations of the difference 
images (Turk and Pentland 1991). If you have access to a high-quality linear algebra pack- 
age such as LAPACK, routines for efficiently computing a small number of the left singular 
vectors and singular values of rectangular matrices such as A are usually provided (Ap- 
pendix C.2). However, if storing all of the images in memory is prohibitive, the construction 
of Cin (A.13) can be used instead. 

How can eigenvalue and singular value decompositions actually be computed? Notice 


that an eigenvector is defined by the equation 


(This can be derived from (A.6) by post-multiplying both sides by u;.) Because the latter 
equation is homogeneous, i.e., it has a zero right-hand-side, it can only have a non-zero (non- 


trivial) solution for u; if the system is rank deficient, i.e., 
\(AI— C)| = 0. (A.16) 


Evaluating this determinant yields a characteristic polynomial equation in A, which can be 
solved for small problems, e.g., 2 x 2 or 3 x 3 matrices, in closed form. 

For larger matrices, iterative algorithms that first reduce the matrix C to a real symmetric 
tridiagonal form using orthogonal transforms and then perform QR iterations are normally 
used (Golub and Van Loan 1996; Trefethen and Bau 1997; Bjórck and Dahlquist 2010). As 
these techniques are rather involved, it is best to use a linear algebra package such as LAPACK 
(Anderson, Bai et al. 1999)—see Appendix C.2. 

Factorization with missing data requires different kinds of iterative algorithms, which of- 
ten involve either hallucinating the missing terms or minimizing some weighted reconstruc- 
tion metric, which is intrinsically much more challenging than regular factorization. This 
area has been widely studied in computer vision (Shum, Ikeuchi, and Reddy 1995; De la 
Torre and Black 2003; Huynh, Hartley, and Heyden 2003; Buchanan and Fitzgibbon 2005; 
Gross, Matthews, and Baker 2006; Torresani, Hertzmann, and Bregler 2008) and is some- 
times called generalized PCA. However, this term is also sometimes used to denote algebraic 
subspace clustering techniques, which is the subject of the monograph by Vidal, Ma, and 
Sastry (2016). 
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A.1.3 QR factorization 


A widely used technique for stably solving poorly conditioned least squares problems (Bjórck 
1996), and the basis of more complex algorithms, such as computing the SVD and eigenvalue 


decompositions, is the QR factorization, 
A = QR, (A.17) 


where Q is an orthonormal (or unitary) matrix QQ” = I and R is upper triangular.? In 
computer vision, QR can be used to convert a camera matrix into a rotation matrix and an 
upper-triangular calibration matrix (11.13) and also in various self-calibration algorithms 
(Section 11.3.4). The most common algorithms for computing QR decompositions (mod- 
ified Gram-Schmidt, Householder transformations, and Givens rotations) are described by 
Golub and Van Loan (1996), Trefethen and Bau (1997), and Bjórck and Dahlquist (2010) and 
are also found in LAPACK. Unlike the SVD and eigenvalue decompositions, QR factoriza- 
tion does not require iteration and can be computed exactly in O( MN? + N°) operations, 


where M is the number of rows and N is the number of columns (for a tall matrix). 


A.1.4 Cholesky factorization 


Cholesky factorization can be applied to any symmetric positive definite matrix C to convert 


it into a product of symmetric lower and upper triangular matrices, 
C=LL? = RTR, (A.18) 


where L is a lower-triangular matrix and R is an upper-triangular matrix. Unlike Gaussian 
elimination, which may require pivoting (row and column reordering) or may become un- 
stable (sensitive to roundoff errors or reordering), Cholesky factorization remains stable for 
positive definite matrices, such as those that arise from normal equations in least squares prob- 
lems (Appendix A.2). Because of the form of (A.18), the matrices L and R are sometimes 
called matrix square roots.* 

The algorithm to compute an upper triangular Cholesky decomposition of C is a straight- 


forward symmetric generalization of Gaussian elimination and is based on the decomposition 


3The term “R” comes from the German name for the lower-upper (LU) decomposition, which is LR for “links” 
and “rechts” (left and right of the diagonal). 
“In fact, there exists a whole family of matrix square roots. Any matrix of the form LQ or QR, where Q is a 


unitary matrix, is a square root of C. 
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procedure Cholesky(C, R): 
R=C 
fori=0...n—1 
forj=:i+1...n— 1 


—1 
Rj j:n-1 = Ry jen-1 — TijTa Ri,jn-1 


-1/2 
Riin- = Riini 


Algorithm A.1 Cholesky decomposition of the matrix C into its upper triangular form R. 


(Björck 1996; Golub and Van Loan 1996) 


T 
c= £ (A.19) 
c Ci 
7 1/2 of] [i oF y 1/2 y U2¿T 
= -1/2 -1,T (4.20) 
cy I 0 Ci¡-cy ec 0 I 
= RIC¡Ro, (A.21) 
which, through recursion, can be turned into 
C=R)...R7_,Rn-1...Ro =R7R. (A.22) 


Algorithm A.l provides a more procedural definition, which can store the upper-triangular 
matrix R in the same space as C, if desired. The total operation count for Cholesky factor- 
ization is O(N®) for a dense matrix but can be significantly lower for sparse matrices with 
low fill-in (Appendix A.4). 

Note that Cholesky decomposition can also be applied to block-structured matrices, where 
the term y in (A.19) is now a square block sub-matrix and c is a rectangular matrix (Golub 
and Van Loan 1996). The computation of square roots can be avoided by leaving the y on the 
diagonal of the middle factor in (A.20), which results in the C = LDL” factorization, where 
D is a diagonal matrix. However, as square roots are relatively fast on modern computers, 


this is not worth the bother and Cholesky factorization is usually preferred. 
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A.2 Linear least squares 


Least squares fitting problems are pervasive in computer vision. For example, the alignment 
of images based on matching feature points involves the minimization of a squared distance 
objective function (8.2), 


Ers = Y lrill? = XC fei; p) — x411?, (A.23) 


where 
r; = x), — f(x;;p) =x, — X; (A.24) 
is the residual between the measured location Xx; and its corresponding current predicted lo- 
cation x; = f(x;;p). More complex versions of least squares problems, such as large-scale 
structure from motion (Section 11.4.2), may involve the minimization of functions of thou- 
sands of variables. Even problems such as image filtering (Section 3.4.1) and regularization 
(Section 4.2) may involve the minimization of sums of squared errors. 
Figure A.2a shows an example of a simple least squares line fitting problem, where the 
quantities being estimated are the line equation parameters (m, b). When the sampled vertical 
values y; are assumed to be noisy versions of points on the line y = mz + b, the optimal 


estimates for (m, b) can be found by minimizing the squared vertical residuals 


Evis = Y [yi — (mai + b)’. (A.25) 


Note that the function being fitted need not itself be linear to use linear least squares. All that 
is required is that the function be linear in the unknown parameters. For example, polynomial 


fitting can be written as 
p 
Epis = ly — (È 9321, (A.26) 
i j=0 


while sinusoid fitting with unknown amplitude A and phase ¢ (but known frequency f) can 
be written as 


Esis = X ly — Asin(27 fa; +4)? = X ly; — (B sin 27 fx, +C cos 2m fx;)|?, (A.27) 


which is linear in (B,C). 


In general, it is more common to denote the unknown parameters using x and to write the 


general form of linear least squares as? 


Eits = Y ax — bj|? = || Ax — b||’. (A.28) 


2 


5Be extra careful in interpreting the variable names here. In the 2D line-fitting example, x is used to denote the 


horizontal axis, but in the general least squares problem, x = (m, b) denotes the unknown parameter vector. 
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(a) (b) 


Figure A.2 Least squares regression. (a) The line y = mx + b is fitted to the four noisy 
data points, [(x;,y;)j, denoted by x, by minimizing the squared vertical residuals between 
the data points and the line, Y”, |y; — (maj + 6)||?. (b) When the measurements [(x;, y;)) 
are assumed to have noise in all directions, the sum of orthogonal squared distances to the 


line Y; laz; + by; + ell? is minimized using total least squares. 
Expanding the above equation gives us 


Eris = x! (A? A)x — 2x7 (ATb) + ||b||?, (A.29) 


whose minimum value for x can be found by solving the associated normal equations (Björck 
1996; Golub and Van Loan 1996) 


(ATA)x = ATb. (A.30) 


The preferred way to solve the normal equations is to use Cholesky factorization. Let 
C=A7A=R’R, (A.31) 
where R is the upper-triangular Cholesky factor of the Hessian C, and 
d=A?b. (A.32) 
After factorization, the solution for x can be obtained as 
R’z=d, Rx =z, (A.33) 


which involves the solution of two triangular systems, 1.e., forward and backward substitution 
(Bjórck 1996). 

In cases where the least squares problem is numerically poorly conditioned (which should 
generally be avoided by adding sufficient regularization or prior knowledge about the param- 


eters (Appendix A.3)), 1t is possible to use QR factorization or SVD directly on the matrix 
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A (Bjórck 1996; Golub and Van Loan 1996; Trefethen and Bau 1997; Nocedal and Wright 
2006; Björck and Dahlquist 2010), e.g., 


Ax = QRx =b => Rx = Q’b. (A.34) 


Note that the upper triangular matrices R produced by the Cholesky factorization of C = 
A” A and the QR factorization of A are the same, but that solving (A.34) is generally more 


stable (less sensitive to roundoff error) but slower (by a constant factor). 


A.2.1 Total least squares 


In some problems, e.g., when performing geometric line fitting in 2D images or 3D plane 
fitting to point cloud data, instead of having measurement error along one particular axis, the 
measured points have uncertainty in all directions, which is known as the errors-in-variables 
model (Van Huffel and Lemmerling 2002; Matei and Meer 2006). In this case, it makes more 


sense to minimize a set of homogeneous squared errors of the form 


Eris = Y (ax)? = ||Ax|l?, (A.35) 

which is known as total least squares (TLS) (Van Huffel and Vandewalle 1991; Bjórck 1996; 
Golub and Van Loan 1996; Van Huffel and Lemmerling 2002). 

The above error metric has a trivial minimum solution at x = 0 and is, in fact, homoge- 

neous in x. For this reason, we augment this minimization problem with the requirement that 


\|x||? = 1. which results in the eigenvalue problem 
x=argminx"(ATA)x  suchthat |x|? = 1. (A.36) 


The value of x that minimizes this constrained problem is the eigenvector associated with the 


smallest eigenvalue of AT A. This is the same as the last right singular vector of A, because 


A=UYNV7, (A.37) 
ATA =VY?V?, (A.38) 
AT Avy = 07 Vhs (A.39) 


which is minimized by selecting the smallest o% value. 

Figure A.2b shows a line-fitting problem where, in this case, the measurement errors are 
assumed to be isotropic in (x, y). The solution for the best line equation ax + by + c = 0 is 
found by minimizing 

Eris—2 = Y (ax; + by: +0)”, (A.40) 


2 
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i.e., finding the eigenvector associated with the smallest eigenvalue of! 


C=ATA=)~ i E yi 1]. (A.41) 
ela 


Notice, however, that minimizing >>,(a,x)? in (A.35) is only statistically optimal (Ap- 
pendix B.1) if all of the measured terms in the aj, e.g., the (x;, y;, 1) measurements, have 
equal noise. This is definitely not the case in the line-fitting example of Figure A.2b (A.40), 
as the 1 values are noise-free. To mitigate this, we first subtract the mean x and y values from 
all the measured points 


Li = 2; — E (A.42) 
Yi =Yi— Y (A.43) 
and then fit the 2D line equation a(x — 2) + b(y — y) = 0 by minimizing 


ErLs-2Dm = » (aĉ; + bpi)’. (A.44) 


i 

The more general case where each individual measurement component can have different 
noise level, as is the case in estimating essential and fundamental matrices (Section 11.3), 
is called the heteroscedastic errors-in-variable (HEIV) model and is discussed by Matei and 
Meer (2006). 


A.3 Non-linear least squares 


In many vision problems, such as structure from motion, the least squares problem formulated 
in (A.23) involves functions f(x;; p) that are not linear in the unknown parameters p. This 
problem is known as non-linear least squares or non-linear regression (Björck 1996; Madsen, 
Nielsen, and Tingleff 2004; Nocedal and Wright 2006). It is usually solved by iteratively re- 
linearizing (A.23) around the current estimate of p using the gradient derivative (Jacobian) 
J = Of /Op and computing an incremental improvement Ap. 
As shown in Equations (8.13-8.17), this results in 
Enus(Ap) = >) ||f(xip + Ap) — x;l|? (A.45) 


a 


~ X` ||I xi; p)Ap — rill, (A.46) 


6 Again, be careful with the variable names here. The measurement equation is a; = (2 ;, yi, 1) and the unknown 


parameters are x = (a,b,c). 
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where the Jacobians J(x,;; p) and residual vectors r; play the same role in forming the normal 
equations as a; and b; in (A.28). 

Because the above approximation only holds near a local minimum or for small values 
of Ap, the update p + p + Ap may not always decrease the summed square residual error 


(A.45). One way to mitigate this problem is to take a smaller step, 
p+ p+adAp, 0<a<l. (A.47) 


A simple way to determine a reasonable value of a is to start with | and successively halve 
the value, which is a simple form of line search (Al-Baali and Fletcher 1986; Bjórck 1996; 
Nocedal and Wright 2006). 

Another approach to ensuring a downhill step in error is to add a diagonal damping term 
to the approximate Hessian 


Ca > Fiji), (A.48) 
1.e., to solve 
[C +A diag(C)]Ap = d, (A.49) 
where 
d=) JT (ri, (A.50) 


which is called a damped Gauss—Newton method. The damping parameter A is increased if 
the squared residual is not decreasing as fast as expected, i.e., as predicted by (A.46), and 
is decreased if the expected decrease is obtained (Madsen, Nielsen, and Tingleff 2004). The 
combination of the Newton (first-order Taylor series) approximation (A.46) and the adaptive 
damping parameter \ is commonly known as the Levenberg—Marquardt algorithm (Leven- 
berg 1944; Marquardt 1963) and is an example of more general trust region methods, which 
are discussed in more detail in Bjórck (1996), Conn, Gould, and Toint (2000), Madsen, 
Nielsen, and Tingleff (2004), and Nocedal and Wright (2006). 

When the initial solution is far away from its quadratic region of convergence around a 
local minimum, large residual methods, e.g., Newton-type methods, which add a second-order 
term to the Taylor series expansion in (A.46), may converge faster. Quasi-Newton methods 
such as BFGS, which require only gradient evaluations, can also be useful if memory size is 
an issue. Such techniques are discussed in textbooks and papers on numerical optimization 
(Toint 1987; Bjórck 1996; Conn, Gould, and Toint 2000; Nocedal and Wright 2006). 
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A.4 Direct sparse matrix techniques 


Many optimization problems in computer vision, such as bundle adjustment (Szeliski and 
Kang 1994; Triggs, McLauchlan et al. 1999; Hartley and Zisserman 2004; Snavely, Seitz, 
and Szeliski 2008b; Agarwal, Snavely et al. 2009) have Jacobian and (approximate) Hessian 
matrices that are extremely sparse (Section 11.4.3). For example, Figure 11.16a shows the 
bipartite model typical of structure from motion problems, in which most points are only 
observed by a subset of the cameras, which results in the sparsity patterns for the Jacobian 
and Hessian shown in Figure 11.16b-<c. 

Whenever the Hessian matrix is sparse enough, it is more efficient to use sparse Cholesky 
factorization instead of regular Cholesky factorization. In such sparse direct techniques, the 
Hessian matrix C and its associated Cholesky factor R are stored in compressed form, in 
which the amount of storage is proportional to the number of (potentially) non-zero entries 
(Bjórck 1996; Davis 2006).’ Algorithms for computing the non-zero elements in C and R 
from the sparsity pattern of the Jacobian matrix J are given by Bjórck (1996, Section 6.4), 
and algorithms for computing the numerical Cholesky and QR decompositions (once the 
sparsity pattern has been computed and storage allocated) are discussed by Björck (1996, 
Section 6.5). More recent publications on direct sparse techniques which discuss supern- 
odal and multifrontal algorithms for large sparse systems include Davis (2006) and Davis, 
Rajamanickam, and Sid-Lakhdar (2016). 


A.4.1 Variable reordering 


The key to efficiently solving sparse problems using direct (non-iterative) techniques is to 
determine an efficient ordering for the variables, which reduces the amount of fill-in, i.e., the 
number of non-zero entries in R that were zero in the original C matrix. We have already 
seen in Section 11.4.3 how storing the more numerous 3D point parameters before the camera 
parameters and using the Schur complement (11.68) results in a more efficient algorithm. 
Similarly, sorting parameters by time in video-based reconstruction problems usually results 
in lower fill-in. Furthermore, any problem whose adjacency graph (the graph corresponding 
to the sparsity pattern) is a tree can be solved in linear time with an appropriate reordering of 
the variables (putting all the children before their parents). All of these are examples of good 
reordering techniques. 


7For example, you can store a list of (i, j, Cij) triples. One example of such a scheme is compressed sparse 
row (CSR) storage. An alternative storage method called skyline, which stores adjacent vertical spans of non-zero 
elements (Bathe 2007), is sometimes used in finite element analysis. Banded systems such as snakes (7.27) can store 
just the non-zero band elements (Bjórck 1996, Section 6.2) and can be solved in O(nb?), where n is the number of 
variables and 6 is the bandwidth. 
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procedure SparseCholeskySolve(C, d): 
1. Determine symbolically the structure of C, i.e., the adjacency graph. 


2. (Optional) Compute a reordering for the variables, taking into ac- 


count any block structure inherent in the problem. 


3. Determine the fill-in pattern for R and allocate the compressed stor- 


age for R as well as storage for the permuted right-hand side d. 


4. Copy the elements of C and d into R and d, permuting the values 


according to the computed ordering. 
5. Perform the numerical factorization of R using Algorithm A.1. 


6. Solve the factored system (A.33), i.e., 


Rx =z. 


7. Return the solution x, after undoing the permutation. 


Algorithm A.2 Sparse least squares using a sparse Cholesky decomposition of the matrix 
C. 


In the general case of unstructured data, there are many heuristics available to find good 
reorderings (Björck 1996; Davis 2006).® For general adjacency (sparsity) graphs, minimum 
degree orderings generally produce good results. For planar graphs, which often arise on 
image or spline grids (Section 9.2.2), nested dissection, which recursively splits the graph 
into two equal halves along a frontier (or boundary) of small size, generally works well. Such 
domain decomposition (or multi-frontal) techniques also enable the use of parallel processing, 
as independent sub-graphs can be processed in parallel on separate processors (Davis 2011). 

The overall set of steps used to perform the direct solution of sparse least squares problems 
is summarized in Algorithm A.2, which is a modified version of Algorithm 6.6.1 by Bjórck 
(1996, Section 6.6)). If a series of related least squares problems is being solved, as is the 
case in iterative non-linear least squares (Appendix A.3), steps 1-3 can be performed ahead of 
time and reused for each new invocation with different C and d values. When the problem is 
block-structured, as is the case in structure from motion where point (structure) variables have 


dense 3 x 3 sub-entries in C and cameras have 6 x 6 (or larger) entries, the cost of performing 


8Finding the optimal reordering with minimal fill-in is provably NP-hard. 
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the reordering computation is small compared to the actual numerical factorization, which 
can benefit from block-structured matrix operations (Golub and Van Loan 1996). It is also 
possible to apply sparse reordering and multifrontal techniques to QR factorization (Davis 
2011), which may be preferable when the least squares problems are poorly conditioned. 


A.5 Iterative techniques 


When problems become large, the amount of memory required to store the Hessian matrix 
C and its factor R, and the amount of time it takes to compute the factorization, can be- 
come prohibitively large, especially when there are large amounts of fill-in. This is often the 
case with image processing problems defined on pixel grids, because, even with the optimal 
reordering (nested dissection) the amount of fill can still be large. 

A preferable approach to solving such linear systems is to use iterative techniques, which 
compute a series of estimates that converge to the final solution, e.g., by taking a series of 
downhill steps in an energy function such as (A.29). 

A large number of iterative techniques have been developed over the years, including such 
well-known algorithms as successive overrelaxation and multi-grid. These are described in 
specialized textbooks on iterative solution techniques (Axelsson 1996; Saad 2003) as well as 
in more general books on numerical linear algebra and least squares techniques (Bjórck 1996; 
Golub and Van Loan 1996; Trefethen and Bau 1997; Nocedal and Wright 2006; Bjórck and 
Dahlquist 2010). 


A.5.1 Conjugate gradient 


The iterative solution technique that often performs best is conjugate gradient descent, which 
takes a series of downhill steps that are conjugate to each other with respect to the C matrix, 
i.e., if the u and v descent directions satisfy u’Cv = 0. In practice, conjugate gradient 
descent outperforms other kinds of gradient descent algorithm because its convergence rate 
is proportional to the square root of the condition number of C instead of the condition 
number itself.” Shewchuk (1994) provides a nice introduction to this topic, with clear intuitive 
explanations of the reasoning behind the conjugate gradient algorithm and its performance. 
Algorithm A.3 describes the conjugate gradient algorithm and its related least squares 
counterpart, which can be used when the original set of least squares linear equations is 


available in the form of Ax = b (A.28). While it is easy to convince yourself that the two 


The condition number «(C) is the ratio of the largest and smallest eigenvalues of C. The actual convergence 


rate depends on the clustering of the eigenvalues, as discussed in the references cited in this section. 


A.5 Iterative techniques 935 


ConjugateGradient(C, d, xo) ConjugateGradientLS(A, b, xo) 

1. ro = d — Cxo 1. qo = b — Axo, ro = A? Qo 
2. Po = ro 2. Po = ro 

3. for k = 3. for k = 

4. wr = Cp, 4. Vk = Apr 

5. ar = |lrkll?/ (Pr + we) 5. ar = |Irell?/Ilvell? 

6. Xk+1 = Xk + QkPk 6. Xk+1 = Xk + QkPk 

7. rk+1 = Tk — QkWk 7. qk+1 = Qk — OVE 

8. 8. rk41 = AT qu+1 

9. Broa = Wresall?/llrell? 9. Broa = Wresall?/llrell? 
10. Prot = Trt + Êk+1Pk 10. Pk+1 = Trt + Êk+1Pk 


Algorithm A.3 Conjugate gradient and conjugate gradient least squares algorithms. The 
algorithms are described in more detail in the text, but in brief, they choose descent directions 
Px that are conjugate to each other with respect to C by computing a factor 8 by which to 
discount the previous search direction pk—1. They then find the optimal step size a and take 


a downhill step by an amount AP. 


forms are mathematically equivalent, the least squares form is preferable if rounding errors 
start to affect the results because of poor conditioning. It may also be preferable if, due to 
the sparsity structure of A, multiplies with the original A matrix are faster or more space 


efficient than multiplies with C. 


The conjugate gradient algorithm starts by computing the current residual ro = d — Cxo, 
which is the direction of steepest descent of the energy function (A.28). It sets the original 
descent direction pp = ro. Next, it multiplies the descent direction by the quadratic form 
(Hessian) matrix C and combines this with the residual to estimate the optimal step size az. 
The solution vector x; and the residual vector rw are then updated using this step size. (Notice 
how the least squares variant of the conjugate gradient algorithm splits the multiplication by 
the C = ATA matrix across steps 4 and 8.) Finally, a new search direction is calculated 


by first computing a factor P as the ratio of current to previous residual magnitudes. The 
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new search direction pz, 1 is then set to the residual plus £ times the old search direction px, 
which keeps the directions conjugate with respect to C. 

It turns out that conjugate gradient descent can also be directly applied to non-quadratic 
energy functions, e.g., those arising from non-linear least squares (Appendix A.3). Instead 
of explicitly forming a local quadratic approximation C and then computing residuals rz, 
non-linear conjugate gradient descent computes the gradient of the energy function E (A.45) 
directly inside each iteration and uses it to set the search direction (Nocedal and Wright 
2006). Because the quadratic approximation to the energy function may not exist or may be 
inaccurate, line search is often used to determine the step size œg. Furthermore, to compen- 
sate for errors in finding the true function minimum, alternative formulas for Pz, 1, such as 


Polak—Ribiére, 
Bray = Ese [VE Cu) — VEC) 
dde [VE 


are often used (Nocedal and Wright 2006). 


(A.51) 


A.5.2 Preconditioning 


As we mentioned previously, the rate of convergence of the conjugate gradient algorithm 
is governed in large part by the condition number «(C). Its effectiveness can therefore be 
increased dramatically by reducing this number, e.g., by rescaling elements in x, which cor- 
responds to rescaling rows and columns in C. 
In general, preconditioning is usually thought of as a change of basis from the vector x to 
a new vector 
ĉ = Sx. (A.52) 


The corresponding linear system being solved then becomes 
AS'#=S"'b or A#=b, (A.53) 

with a corresponding least squares energy (A.29) of the form 
Epis = £7 (8-7 CS~1)& — 227 (87d) + ||b]|. (A.54) 


The actual preconditioned matrix C = S77CS”! is usually not explicitly computed. In- 
stead, Algorithm A.3 is extended to insert S~7 and ST operations at the appropriate places 
(Bjorck 1996; Golub and Van Loan 1996; Trefethen and Bau 1997; Saad 2003; Nocedal and 
Wright 2006). 

A good preconditioner S is easy and cheap to compute, but is also a decent approximation 


to a square root of C, so that k(S77CS”!) is closer to 1. The simplest such choice is the 
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square root of the diagonal matrix S = D!/?, with D = diag(C). This has the advantage 
that any scalar change in variables (e.g., using radians instead of degrees for angular measure- 
ments) has no effect on the range of convergence of the iterative technique. For problems that 
are naturally block-structured, e.g., for structure from motion, where 3D point positions or 
6D camera poses are being estimated, a block diagonal preconditioner is often a good choice. 

A wide variety of more sophisticated preconditioners have been developed over the years 
(Bjórck 1996; Golub and Van Loan 1996; Trefethen and Bau 1997; Saad 2003; Nocedal and 
Wright 2006), many of which can be directly applied to problems in computer vision (Byréd 
and Ástróm 2009; Agarwal, Snavely et al. 2010; Jeong, Nistér et al. 2012). Some of these are 
based on an incomplete Cholesky factorization of C, i.e., one in which the amount of fill-in in 
R is strictly limited, e.g., to just the original non-zero elements in C.'° Other preconditioners 
are based on a sparsified, e.g., tree-based or clustered, approximation to C (Koutis 2007; 
Koutis and Miller 2008; Grady 2008; Koutis, Miller, and Tolliver 2009), as these are known 
to have efficient inversion properties. 

For grid-based image-processing applications, parallel or hierarchical preconditioners 
often perform extremely well (Yserentant 1986; Szeliski 1990b; Pentland 1994; Saad 2003; 
Szeliski 2006b; Krishnan and Szeliski 2011; Krishnan, Fattal, and Szeliski 2013). These 
approaches use a change of basis transformation S that resembles the pyramidal or wavelet 
representations discussed in Section 3.5, and are hence amenable to parallel and GPU-based 
implementations (Figure 3.35b). Coarser elements in the new representation quickly con- 
verge to the low-frequency components in the solution, while finer-level elements encode 
the higher-frequency components. Some of the relationships between hierarchical precondi- 
tioners, incomplete Cholesky factorization, and multigrid techniques are explored by Saad 
(2003) and Szeliski (2006b), Krishnan and Szeliski (2011), and Krishnan, Fattal, and Szeliski 
(2013). 


A.5.3 Multigrid 


One other class of iterative techniques widely used in computer vision is multigrid techniques 
(Briggs, Henson, and McCormick 2000; Trottenberg, Oosterlee, and Schuller 2000), which 
have been applied to problems such as surface interpolation (Terzopoulos 1986a), optical flow 
(Terzopoulos 1986a; Bruhn, Weickert ef al. 2006), high dynamic range tone mapping (Fattal, 
Lischinski, and Werman 2002), colorization (Levin, Lischinski, and Weiss 2004), natural 
image matting (Levin, Lischinski, and Weiss 2008), and segmentation (Grady 2008). 


!0Tf a complete Cholesky factorization C = RTR is used, we get Ĉ = RTCR-! = I and all iterative 
algorithms converge in a single step, thereby obviating the need to use them, but the complete factorization is often 


too expensive. Note that incomplete factorization can also benefit from reordering. 
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The main idea behind multigrid is to form coarser (lower-resolution) versions of the prob- 
lems and use them to compute the low-frequency components of the solution. However, 
unlike simple coarse-to-fine techniques, which use the coarse solutions to initialize the fine 
solution, multigrid techniques only correct the low-frequency component of the current solu- 
tion and use multiple rounds of coarsening and refinement (in what are often called “V” and 
“W” patterns of motion across the pyramid) to obtain rapid convergence. 

On certain simple homogeneous problems (such as solving Poisson equations), multigrid 
techniques can achieve optimal performance, i.e., computation times linear in the number 
of variables. However, for more inhomogeneous problems or problems on irregular grids, 
variants on these techniques, such as algebraic multigrid (AMG) approaches, which look at 
the structure of C to derive coarse level problems, may be preferable. Saad (2003) has a 
nice discussion of the relationship between multigrid and parallel preconditioners and on the 
relative merits of using multigrid or conjugate gradient approaches. 
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As you may have noticed, the following problem commonly recurs in computer vision ap- 
plications. Given a number of measurements (images, feature positions, etc.), estimate the 
values of some unknown structure or parameters (camera positions, object shape, etc.). These 
kinds of problems are in general called inverse problems because they involve estimating un- 
known model parameters instead of simulating the forward formation equations.! Computer 
graphics is a classic forward modeling problem (given some objects, cameras, and lighting, 
simulate the images that would result), while computer vision problems are usually of the 
inverse kind (given one or more images, recover the scene that gave rise to these images). 

Given an instance of an inverse problem, there are, in general, several ways to proceed. 
For instance, through clever (or sometimes straightforward) algebraic manipulation, a closed 
form solution for the unknowns can sometimes be derived. Consider, for example, the camera 
matrix calibration problem (Section 11.2.1): given an image of a calibration pattern consist- 
ing of known 3D point positions, compute the 3 x 4 camera matrix P that maps these points 
onto the image plane. 


In more detail, we can write this problem as (11.11-11.12) 


_ _ PooXÑi + porYi + po2Zi + Pos (B.1) 

“poo Xi + par Yi + p22Zi + p23 f 
pioX: + pr1 Y + p12%i + pis 

= ; (B.2) 
P20Xi + par Y; + p22Zi + P23 


where (x;, yi) is the feature position of the ¿th point measured in the image plane, (X;, Yi, Zi) 
is the corresponding 3D point position, and the p;; are the unknown entries of the camera 
matrix P. Moving the denominator over to the left-hand side, we end up with a set of simul- 


taneous linear equations, 


zilp20X; + par Yi + po2Z; + p23) = pooXi + por Yi + Po2Zi + Pos, (B.3) 


yilpa0 Xi + par Yi + p22Zi + p23) = pioXi + pir Yi + p12Z; + pis, (B.4) 


which we can solve using linear least squares (Appendix A.2) to obtain an estimate of P. 
The question then arises: Is this set of equations the right ones to be solving? If the 
measurements are totally noise-free or we do not care about getting the best possible answer, 
then the answer is yes. However, in general, we cannot be sure that we have a reasonable 
algorithm unless we make a model of the likely sources of error and devise an algorithm that 
performs as well as possible given these potential errors. 
In the rest of this appendix, we provide a brief tutorial on the fundamentals of Bayesian 


modeling and inference. We start with estimation theory (how to build forward models 


l! As we saw in Chapters 4 and 5, these problems are called regression problems, because we are trying to estimate 


a continuous quantity from noisy inputs, as opposed to a discrete classification task (Bishop 2006). 
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that account for noise) and show how to model likelihoods under Gaussian noise. We then 
show how when the measurements are linear, these result in least squares regression. In 
Appendix B.3, we review robust estimation techniques designed to deal with measurement 
outliers (gross errors). Appendices B.4 and B.5 discuss Bayesian prior models and Markov 
random fields, which are compact local priors suitable for image processing. We also describe 
a number of widely used inference algorithms for finding good solutions to MRF problems. 


Finally, Appendix B.6 describes how we can model the posterior uncertainty in our estimates. 


B.1 Estimation theory 


The study of inverse inference problems from noisy data is often called estimation theory 
(Gelb 1974), and its extension to problems where we explicitly choose a loss function is 
called statistical decision theory (Berger 1993; MacKay 2003; Bishop 2006; Robert 2007; 
Hastie, Tibshirani, and Friedman 2009; Murphy 2012; Deisenroth, Faisal, and Ong 2020). We 
first start by writing down the forward process that leads from our unknowns (and knowns) 
to a set of noise-corrupted measurements. We then devise an algorithm that will give us an 
estimate (or set of estimates) that are both insensitive to the noise (as best they can be) and 
also quantify the reliability of these estimates. In this Appendix, I provide a very condensed 
overview of this topic, including an introduction to basic probability and Bayesian inference. 
Much more detailed and informative treatment can be found in the books by Bishop (2006), 
Hastie, Tibshirani, and Friedman (2009), and (Murphy 2012) and Deisenroth, Faisal, and Ong 
(2020)). 

The perspective projection equations above are just a particular instance of a more general 


set of measurement equations, 


yi = f(x) + nj. (B.5) 


Here, the y; are the noise-corrupted measurements, e.g., (xi, yi) in Equations (B.1-B.2) and 
x is the unknown state vector.” 

Each measurement comes with its associated measurement model f£;(x), which maps the 
unknown into that particular measurement. Note that the use of the f;(x) form makes it 
straightforward to have measurements of different dimensions, which becomes useful when 
we start adding in prior information (Appendix B.4). 

Each measurement is also contaminated with some noise n;. In Equation (B.7) we specify 
that n; is a zero-mean normal (Gaussian) random variable with a covariance matrix >;. In 


general, the noise need not be Gaussian and, in fact, it is usually prudent to assume that some 


?Tn the Kalman filtering literature (Gelb 1974), it is more common to use z instead of y to denote measurements. 
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measurements may be outliers. However, we defer this discussion to Appendix B.3, after we 
have explored the simpler Gaussian noise case more fully. We also assume that the noise 
vectors n; are independent. In the case where they are not (e.g., when some constant gain or 
offset contaminates all of the pixels in a given image), we can add this effect as a nuisance 


parameter to our state vector x and later estimate its value (and discard it, 1f so desired). 


Likelihood for multivariate Gaussian noise 


Given all of the noisy measurements y = {y;}, we would like to infer a probability distribu- 
tion on the unknown x vector. We can write the likelihood of having observed the [y,) given 
a particular value of x as 


L = p(y|x) = [In (y;[x) = | In (yilfi(x ~ [To (nj). (B.6) 


When each noise vector n; is a multivariate Gaussian with covariance >,, 
n; ~ N(0, di), (B.7) 


we can write this likelihood as 


1 
L= II ¡275,712 exp (¿0 = £a) E (y: = 1.00) 
-1/2 1 2 
=][ |(27£:[ "> exp =3llyi— EG) ) > 


where the matrix norm ||x||Ã is a shorthand notation for x7 Ax. 


(B.8) 


The norm ||y; — y ¿|| -1 is often called the Mahalanobis distance, which we introduced 
in (5.32), and is used to measure the distance between a measurement and the mean of a 
multivariate Gaussian distribution (Bishop 2006, Section 2.3; Hartley and Zisserman 2004, 
Appendix 2). Contours of equal Mahalanobis distance are equi-probability contours (Fig- 
ure 5.9). Note that when the measurement covariance is isotropic (the same in all directions), 


i.e., when >; = o7I, the likelihood can be written as 
1 
b= Jerod e (Elvia 6012). 839) 


where N; is the length of the ¿th measurement vector y;. 

We can more easily visualize the structure of the covariance matrix and the correspond- 
ing Mahalanobis distance if we first perform an eigenvalue or principal component analysis 
(PCA) of the covariance matrix (A.6), 


X; = 9 diag(Mo...Ay-1) 9”. (B.10) 
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Equal-probability contours of the corresponding multi-variate Gaussian, which are also equi- 
distance contours in the Mahalanobis distance (Figure 5.19), are multi-dimensional ellipsoids 
whose axis directions are given by the columns of ® (the eigenvectors) and whose lengths 
are given by the 0; = Aj (Figure A.1). 

It is usually more convenient to work with the negative log likelihood, which we can think 


of as a cost or energy 
1 
E = -log L = 5S (y: - i) Er (y: — fi(x)) +k (B.11) 
1 2 
= 3 2 lly- t)i +4, (B.12) 


where k = »>”,log |272,| is a constant that depends on the measurement variances, but is 
independent of x. 

Notice that the inverse covariance C; = D plays the role of a weight on each of the 
measurement error residuals, i.e., the difference between the contaminated measurement y; 
and its uncontaminated (predicted) value f;(x). In fact, the inverse covariance is often called 
the (Fisher) information matrix (Bishop 2006), because it tells us how much information is 
contained in a given measurement, i.e., how well it constrains the final estimate. We can also 
think of this matrix as denoting the amount of confidence to associate with each measurement 
(hence the letter C). 

In this formulation, it is quite acceptable for some information matrices to be singular 
(of degenerate rank) or even zero (if the measurement is missing altogether). Rank-deficient 
measurements often occur, for example, when using a line feature or edge to measure a 3D 
edge-like feature, as its exact position along the edge is unknown (or of infinite or extremely 
large variance) (Section 9.1.3). 

To make the distinction between the noise contaminated measurement and its expected 
value for a particular setting of x more explicit, we adopt the notation y for the former (think 
of the tilde as the approximate or noisy value) and Y = f;(x) for the latter (think of the hat as 
the predicted or expected value). We can then write the negative log likelihood as 


1 ~ a 1/2 
E = — log L = AS +k. (B.13) 


B.2 Maximum likelihood estimation and least squares 


Now that we have presented the likelihood and log likelihood functions, how can we find the 


optimal value for our state estimate x? One plausible choice might be to select the value of x 
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that maximizes L = p(y|x). In fact, in the absence of any prior model for x (Appendix B.4), 


we have 
L = p(y|x) = p(y, x) = p(x]y). (B.14) 


Therefore, choosing the value of x that maximizes the likelihood is equivalent to choosing 
the maximum of our probability density estimate for x. 

When might this be a good idea? If the data (measurements) constrain the possible values 
of x so that they all cluster tightly around one value (e.g., if the distribution p(x|y) is a 
unimodal Gaussian), the maximum likelihood estimate is the optimal one in that it is both 
unbiased and has the least possible variance. In many other cases, e.g., if a single estimate is 
all that is required, it is still often the best estimate.’ 

However, if the probability is multi-modal, 1.e., it has several local minima in the log like- 
lihood, much more care may be required. In particular, it might be necessary to defer certain 
decisions (such as the ultimate position of an object being tracked) until more measurements 
have been taken. The CONDENSATION algorithm presented in Section 7.3.1 is one possible 
method for modeling and updating such multi-modal distributions but is just one example 
of more general particle filtering and Markov Chain Monte Carlo (MCMC) techniques (An- 
drieu, de Freitas et al. 2003; Bishop 2006; Koller and Friedman 2009). 

Another possible way to choose the best estimate is to maximize the expected utility 
(or, conversely, to minimize the expected risk or loss) associated with obtaining the correct 


estimate, i.e., by minimizing 


Eloss(X, y) = [ic — z)p(z|y)dz. (B.15) 


For example, if a robot wants to avoid hitting a wall at all costs, the loss function will be high 
whenever the estimate underestimates the true distance to the wall. When /(x—y) = 9(x—y), 
we obtain the maximum likelihood estimate, whereas when I(x — y) = ||x — y||?, we obtain 
the mean square error (MSE) or expected value estimate. The explicit modeling of a utility 
or loss function is what characterizes statistical decision theory (Berger 1993; MacKay 2003; 
Bishop 2006; Robert 2007; Hastie, Tibshirani, and Friedman 2009; Murphy 2012; Deisen- 
roth, Faisal, and Ong 2020) and the minimization of expected risk (in machine learning) is 
called empirical risk minimization, which we discussed in Section 5.1, Equation (5.1). 

How do we find the maximum likelihood estimate? If the measurement noise is Gaussian, 


we can minimize the quadratic objective function (B.13). This becomes even simpler if the 


3 According to the Gauss-Markov theorem, least squares produces the best linear unbiased estimator (BLUE) for 
a linear measurement model regardless of the actual noise distribution, assuming that the noise is zero mean and 


uncorrelated. 
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measurement equations are linear, i.e., 
f; (x) = H;x, (B.16) 


where H is the measurement matrix relating unknown state variables x to measurements y. 


In this case, (B.13) becomes 
E = X |f: — Hixlly =) (9: — Hix)" Ci (Y, — Hix), (B.17) 
which is a simple quadratic form in x, which can be solved using linear least squares (Ap- 


pendix A.2) to obtain the minimum energy (maximum likelihood) solution 


-1 
x= B srca) (= nos) (B.18) 


t 


a 


with a corresponding posterior covariance of 


ží 
N=C"= (= srca) . (B.19) 


When H; = I, i.e., when we are just taking an average of covariance-weighted measurements, 


we obtain the even simpler formula 


x= (= cı) Ñ B as) f (B.20) 


which is a simple information-weighted mean, with a final covariance (uncertainty) of © = 
(de Ci). 

When the measurements are non-linear, the system must be solved iteratively using non- 
linear least squares (Appendix A.3). In this case, we can compute a Cramer—Rao lower bound 
(CRLB) on the posterior covariance using the same covariance formula as before (B.19) ex- 
cept that we use the Jacobians J(x;; p) from (A.46) are used instead of the measurement 


matrices H;. 


B.3 Robust statistics 


In Appendix B.1, we assumed that the noise being added to each measurement (B.5) was 
multivariate Gaussian (B.7). This is an appropriate model if the noise is the result of lots of 


tiny errors being added together, e.g., from thermal noise in a silicon imager. In most cases, 
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however, measurements can be contaminated with larger outliers, 1.e., gross failures in the 
measurement process. Examples of such outliers include bad feature matches (Section 8.1.4), 
occlusions in stereo matching (Chapter 12), and discontinuities in an otherwise smooth image, 
depth map, or label image (Sections 4.2.1 and 4.3). 

In such cases, 1t makes more sense to model the measurement noise with a long-tailed 
contaminated noise model, such as a Laplacian. The negative log likelihood in this case, 
rather than being quadratic in the measurement residuals (B.12—B.17), has a slower growth 
in the penalty function to account for the increased likelihood of large errors. 

This formulation of the inference problem is called an M-estimator in the robust statistics 
literature (Huber 1981; Hampel, Ronchetti et al. 1986; Black and Rangarajan 1996; Stewart 
1999; Barron 2019) and involves applying a robust penalty function p(r) to the residuals 


Enus(Ap) = X` p(|lrill) (B.21) 


instead of squaring them. Over the years, a variety of robust loss functions have been devel- 
oped, as discussed in the above references. Recently, Barron (2019) unified a number of 
these under a two-parameter loss function, which we introduced in Section 4.1.3. This loss 


function, shown in Figure 4.7, can be written as 


— a/c)? ae 
paad = a 2| (s +1) -1) , (B.22) 


where a is a shape parameter that controls the robustness of the loss and c > 0 is a scale 
parameter that controls the size of the loss’s quadratic bowl near x = 0. In his paper, Barron 
(2019) discusses how both parameters can be determined at run time by maximizing the likeli- 
hood (or equivalently, minimizing the negative log-likelihood) of the given residuals, making 
such an algorithm self-tuning to a wide variety of noise levels and outlier distributions. 

As we mentioned in Section 8.1.4, we can take the derivative of this function with respect 


to the unknown parameters p we are estimating and set it to 0, 


q 0llráll — So gle) 7 Ori _ 


where y (r) = p’(r) is the derivative of p and is called the influence function. If we introduce a 
weight function, w(r) = V(r)/r, we observe that finding the stationary point of (B.21) using 


(B.23) is equivalent to minimizing the iteratively re-weighted least squares (IRLS) problem 


Eris =>) w((lrall) rel, (B.24) 


i 
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where the w(||r;||) play the same local weighting role as C; = E; * in (B.12). Black and 
Anandan (1996) describe a variety of robust penalty functions and their corresponding influ- 
ence and weighting function. 

The IRLS algorithm alternates between computing the influence functions w(||r;||) and 
solving the resulting weighted least squares problem (with fixed w values). Alternative incre- 
mental robust least squares algorithms can be found in the work of Sawhney and Ayer (1996), 
Black and Anandan (1996), Black and Rangarajan (1996), and Baker, Gross et al. (2003) 
and textbooks and tutorials on robust statistics (Huber 1981; Hampel, Ronchetti et al. 1986; 
Rousseeuw and Leroy 1987; Stewart 1999). It is also possible to apply general optimization 
techniques (Appendix A.3) directly to the non-linear cost function given in Equation (B.24), 
which may sometimes have better convergence properties. 

Most robust penalty functions involve a scale parameter, which should typically be set to 
the variance (or standard deviation, depending on the formulation) of the non-contaminated 
(inlier) noise. Estimating such noise levels directly from the measurements or their residuals, 
however, can be problematic, as such estimates themselves become contaminated by outliers. 
The robust statistics literature contains a variety of techniques to estimate such parameters. 


One of the simplest and most effective is the median absolute deviation (MAD), 
MAD = med; l|r;!|, (B.25) 


which, when multiplied by 1.4, provides a robust estimate of the standard deviation of the 
inlier noise process. 

As mentioned in Section 8.1.4, it is often better to start iterative non-linear minimiza- 
tion techniques, such as IRLS, in the vicinity of a good solution by first randomly selecting 
small subsets of measurements until a good set of inliers is found. The best known of these 
techniques is RANdom SAmple Consensus (RANSAC) (Fischler and Bolles 1981), although 
even better variants such as Preemptive RANSAC (Nistér 2003), PROgressive SAmple Con- 
sensus (PROSAC) (Chum and Matas 2005), USAC (Raguram, Chum et al. 2012), and Latent 
RANSAC (Korman and Litman 2018) have since been developed. The paper by Raguram, 
Chum et al. (2012) provides a nice experimental comparison of most of these techniques. 

Additional variants on RANSAC include MLESAC (Torr and Zisserman 2000), DSAC 
(Brachmann, Krull et al. 2017), Graph-Cut RANSAC (Barath and Matas 2018), MAGSAC 
(Barath, Matas, and Noskova 2019), and ESAC (Brachmann and Rother 2019). The MAGSAC++ 
paper by Barath, Noskova et al. (2020) compares many of these variants. Yang, Antonante 
et al. (2020) claim that using a robust penalty function with a decreasing outlier parame- 
ter, i.e., graduated non-convexity (Blake and Zisserman 1987; Barron 2019), can outperform 


RANSAC in many geometric correspondence and pose estimation problems. 
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B.4 Prior models and Bayesian inference 


While maximum likelihood estimation can often lead to good solutions, in some cases the 
range of possible solutions consistent with the measurements is too large to be useful. For 
example, consider the problem of image denoising (Section 3.4.2). If we estimate each pixel 
separately based on just its noisy version, we cannot make any progress, as there are a large 
number of values that could lead to each noisy measurement.* Instead, we need to rely on 
typical properties of images, e.g., that they tend to be piecewise smooth (Section 4.2.1). 

The propensity of images to be piecewise smooth can be encoded in a prior distribution 
p(x), which measures the likelihood of an image being a natural image. Statistical models 
where we construct or estimate a prior distribution over the unknowns we are trying to re- 
cover are known as generative models. As the prior distribution is known, we can generate 
random samples and see if they conform to our expected appearance or distribution, although 
sometimes the sampling process may itself involve a lot of computation. For example, to 
encode piecewise smoothness, we can use a Markov random field model (4.38 and B.29) 
whose negative log likelihood is proportional to a robustified measure of image smoothness 
(gradient magnitudes). 

Prior models need not be restricted to image processing applications. For example, we 
may have some external knowledge about the rough dimensions of an object being scanned, 
the focal length of a lens being calibrated, or the likelihood that a particular object might 
appear in an image. All of these are examples of prior distributions or probabilities and they 
can be used to produce more reliable estimates. 

As we have already seen in (4.33), Bayes” rule states that a posterior distribution p(x|y) 
over the unknowns x given the measurements y can be obtained by multiplying the measure- 


ment likelihood p(y|x) by the prior distribution p(x) and normalizing, 


p(xly) = Plyle)plx) (B.26) 

p(y) 

where p(y) = f, p(y|x)p(x) is a normalizing constant used to make the p(x|y) distribution 

proper (integrate to 1). Taking the negative logarithm of both sides of Equation (B.26), we 
get 

— log p(x]y) = — log p(y|x) — log p(x) + log p(y), (B.27) 


which is the negative posterior log likelihood. It is common to drop the constant log p(y) be- 
cause its value does not matter during energy minimization. However, if the prior distribution 
p(x) depends on some unknown parameters, we may wish to keep log p(y) in order to com- 


pute the most likely value of these parameters using Occam’s razor, i.e., by maximizing the 


‘Tn fact, the maximum likelihood estimate is just the noisy image itself. 
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likelihood of the observations, or to select the correct number of free parameters using model 
selection (Torr 2002; Bishop 2006; Robert 2007; Hastie, Tibshirani, and Friedman 2009). 

To find the most likely (maximum a posteriori or MAP) solution x given some measure- 
ments y, we simply minimize this negative log likelihood, which can also be thought of as an 
energy, 


E(x, y) = Ea(x, y) + Ep(x). (B.28) 


The first term Ey(x, y) is the data energy or data penalty and measures the negative log 
likelihood that the measurements y were observed given the unknown state x. The second 
term E,(x) is the prior energy and it plays a role analogous to the smoothness energy in 
regularization. Note that the MAP estimate may not always be desirable, because it selects 
the “peak” in the posterior distribution rather than some more stable statistic such as MSE— 


see the discussion in Appendix B.2 about loss functions and decision theory. 


B.5 Markov random fields 


Markov random fields (Blake, Kohli, and Rother 2011) are the most popular types of prior 
model for gridded image-like data, which include not only regular natural images (Sec- 
tion 4.3) but also two-dimensional fields such as optical flow (Chapter 9) or depth maps 
(Chapter 12), as well as binary fields, such as segmentations (Section 4.3.2).5 

As we discussed in Section 4.3, the prior probability p(x) for a Markov random field is 
a Gibbs or Boltzmann distribution, whose negative log likelihood (according to the Hammer- 


sley—Clifford Theorem) can be written as a sum of pairwise interaction potentials, 


Ep(x)= >> Vignal f(i,9), £(&,9), (B.29) 
{ (4,5), (4,1) FEN 

where N (i, j) denotes the neighbors of pixel (i, 7). In the more general case, MRFs can also 
contain unary potentials, as well as higher-order potentials defined over larger cardinality 
cliques (Kindermann and Snell 1980; Geman and Geman 1984; Bishop 2006; Potetz and Lee 
2008; Kohli, Kumar, and Torr 2009; Kohli, Ladicky, and Torr 2009; Rother, Kohli et al. 2009; 
Alahari, Kohli, and Torr 2010). They can also contain line processes, i.e., additional binary 
variables that mediate discontinuities between adjacent elements (Geman and Geman 1984). 
Black and Rangarajan (1996) show how independent line process variables can be eliminated 


and incorporated into regular MRFs using robust pairwise penalty functions. 


5 Alternative formulations include power spectra (Section 3.4.1) and non-local means (Buades, Coll, and Morel 
2008). Many people would argue that deep neural networks provide learned priors over the output distributions, 
although these are not strictly Bayesian priors that can be additively combined with measurements in a log likelihood 


domain. 
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The most commonly used neighborhood in Markov random field modeling is the M4 
neighborhood, where each pixel in the field f(z, j) interacts only with its immediate neigh- 
bors; Figure 4.12 shows such an M4 MRF. The s,(i, j) and sy(i, j) black boxes denote arbi- 
trary interaction potentials between adjacent nodes in the random field and the w(i, j) denote 
the elemental data penalty terms in Eg (B.28). These square nodes can also be interpreted 
as factors in a factor graph version of the undirected graphical model (Bishop 2006; Wain- 
wright and Jordan 2008; Koller and Friedman 2009; Dellaert and Kaess 2017; Dellaert 2021), 
which is another name for interaction potentials. (Strictly speaking, the factors are improper 
probability functions whose product is the un-normalized posterior distribution.) 

More complex and higher-dimensional interaction models and neighborhoods are also 
possible. For example, 2D grids can be enhanced with the addition of diagonal connections 
(an Ns neighborhood) or even larger numbers of pairwise terms (Boykov and Kolmogorov 
2003; Rother, Kolmogorov et al. 2007). 3D grids can be used to compute globally opti- 
mal segmentations in 3D volumetric medical images (Boykov and Funka-Lea 2006) (Sec- 
tion 6.4.1). Higher-order cliques can also be used to develop more sophisticated models 
(Potetz and Lee 2008; Kohli, Ladicky, and Torr 2009; Kohli, Kumar, and Torr 2009). 

One of the biggest challenges in using MRF models is to develop efficient inference algo- 
rithms that will find low-energy solutions (Veksler 1999; Boykov, Veksler, and Zabih 2001; 
Kohli 2007; Kumar 2008). Over the years, a large variety of such algorithms have been de- 
veloped, including simulated annealing, graph cuts, and loopy belief propagation. The choice 
of inference technique can greatly affect the overall performance of a vision system. For ex- 
ample, most of the top-performing algorithms on the Middlebury Stereo Evaluation page use 
either belief propagation or graph cuts. 

The first edition of this book (Szeliski 2010, Appendix B.5) had more detailed expla- 
nations of the most widely used MRF inference techniques, including gradient descent and 
simulated annealing, dynamic programming, belief propagation, graph cuts, and linear pro- 
gramming, which are a subset of the methods evaluated by Kappes, Andres et al. (2015) and 
shown in Figure B.1. However, since MRFs have now largely been replaced with deep neural 
networks in most applications, I have omitted these descriptions from this new edition. In- 
stead, interested readers should look in the first edition and also the book on advanced MRF 
techniques by Blake, Kohli, and Rother (2011). Experimental comparisons, along with test 
datasets and reference software, can be found in the papers by Szeliski, Zabih er al. (2008)° 
and Kappes, Andres et al. (2015).’ 


Shttps://vision.middlebury.edu/MRF. 
Thttp://hciweb2.iwr.uni-heidelberg.de/opengm 
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Figure B.1 Schematic taxonomy of the inference methods evaluated in the benchmark study 
by Kappes, Andres et al. (2015) © 2015 Springer. 
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B.6 Uncertainty estimation (error analysis) 


In addition to computing the most likely estimate, many applications require an estimate for 


the uncertainty in this estimate. 


The most general way to do this is to compute a com- 
plete probability distribution over all of the unknowns, but this is generally intractable. The 
one special case where it is easy to obtain a simple description for this distribution is linear 
estimation problems with Gaussian noise, where the joint energy function (negative log like- 
lihood of the posterior estimate) is a quadratic. In this case, the posterior distribution is a 
multi-variate Gaussian and its covariance 4 can be computed directly from the inverse of the 
noise-weighted problem Hessian, as shown in (B.19. (Another name for the inverse covari- 


ance matrix, which is equal to the Hessian in such simple cases, is the information matrix.) 


Even here, however, the full covariance matrix may be too large to compute and store. For 
example, in large structure from motion problems, a large sparse Hessian normally results in a 
full dense covariance matrix. In such cases, it is often considered acceptable to report only the 
variance in the estimated quantities or simple covariance estimates on individual parameters, 
such as 3D point positions or camera pose estimates (Szeliski 1990a). More insight into the 
problem, e.g., the dominant modes of uncertainty, can be obtained using eigenvalue analysis 
(Szeliski and Kang 1997). 

For problems where the posterior energy is non-quadratic, e.g., in non-linear or robustified 
least squares, it is still often possible to obtain an estimate of the Hessian in the vicinity of the 
optimal solution. In this case, the Cramer—Rao lower bound on the uncertainty (covariance) 
can be computed as the inverse of the Hessian. Another way of saying this is that while the 
local Hessian can underestimate how “wide” the energy function can be, the covariance can 
never be smaller than the estimate based on this local quadratic approximation. It is also 
possible to estimate a different kind of uncertainty (min-marginal energies) in general MRFs 
where the MAP inference is performed using graph cuts (Kohli and Torr 2008). 

While many computer vision applications ignore uncertainty modeling, it is often useful 
to compute these estimates just to get an intuitive feeling for the reliability of the estimates. 
Certain applications, such as Kalman filtering, require the computation of this uncertainty 
(either explicitly as posterior covariances or implicitly as inverse covariances) to optimally 
integrate new measurements with previously computed estimates (Dickmanns and Graefe 
1988; Matthies, Kanade, and Szeliski 1989; Szeliski 1989). 


SThis is particularly true of classic photogrammetry applications, where the reporting of precision is almost 


always considered mandatory (Forstner 2005). 
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In this final appendix, I summarize some of the supplementary materials that may be use- 
ful to students, instructors, and researchers. The book’s website at https://szeliski.org/Book 


contains updated lists of related courses, so please check there as well. 


C.1 Datasets and benchmarks 


As I mentioned in the introduction, one of the keys to developing reliable vision algorithms 
is to test your procedures on challenging and representative datasets. When ground truth or 
other people’s results are available, such test can be even more informative (and quantitative). 

Over the years, a large number of datasets have been developed for testing and evaluating 
computer vision algorithms, e.g., Middlebury stereo (Scharstein and Szeliski 2002), PASCAL 
(Everingham, Van Gool et al. 2010), ImageNet (Russakovsky, Deng et al. 2015), KITTI 
(Geiger, Lenz, and Urtasun 2012), Sintel (Butler, Wulff et al. 2012), and COCO (Lin, Maire 
et al. 2014). 

Many of these datasets come with associated benchmarks where the results (and often 
pointers to code) for the latest algorithms can be found. I have already mentioned (and in 
some cases tabulated) many of these datasets in previous chapters of the book. In this ap- 
pendix, I provide a summary of these datasets. You can also find older, less frequently used 
datasets in the first edition of this book (Szeliski 2010, Appendix C.1) and an up-to-date list 
on VisionBib.Com (http://datasets.visionbib.com), which has been curated and maintained by 
Keith Price since 1994. 

Below, I list some of the more popular datasets, grouped by the book chapters to which 
they most closely correspond. 


Chapter 2: Image formation 


e CUReT: Columbia-Utrecht Reflectance and Texture Database, https://www 1.cs.columbia. 
edu/CAVE/software/curet (Dana, van Ginneken et al. 1999). 


e Middlebury Color Datasets: registered color images taken by different cameras to 
study how they transform gamuts and colors, https://vision.middlebury.edu/color/data 
(Chakrabarti, Scharstein, and Zickler 2009). 


Chapter 4: Model fitting and optimization 


e Middlebury test datasets for evaluating MRF minimization/inference algorithms, https: 
//vision.middlebury.edu/MRF/results (Szeliski, Zabih et al. 2008). 
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The OpenGM2 library and benchmarks for discrete factor graph models, http://hciweb2. 
iwr.uni-heidelberg.de/opengm (Kappes, Andres et al. 2015). 


Chapter 5: Deep learning 


Small-scale datasets suitable for training a simple CNN as a useful teaching tool:! 
MNIST (LeCun, Cortes, and Burges 1998), CIFAR-100 (Krizhevsky 2009), and Fash- 
ion MNIST (Xiao, Rasul, and Vollgraf 2017). 


PyTorch TorchVision provides a great way to easily download some of the popular 
computer vision datasets, https://pytorch.org/vision/stable/datasets.html. TensorFlow 
also provides similar support with TensorFlow Datasets, https://www.tensorflow.org/ 


datasets. 


Widely used recognition, detection, and segmentation datasets and benchmarks, as 
listed in Tables 6.1-6.4; separate datasets for other tasks such as image enhancement, 


motion estimation, and stereo, are discussed in later sections. 


Chapter 6: Recognition 


The face recognition and detection datasets listed in Table 6.1 and Masi, Wu et al. 
(2018). 


The Caltech pedestrian detection benchmark (Dollar, Belongie, and Perona 2010) and 
person detection subtasks in datasets such as KITTI, http://www.cvlibs.net/datasets/ 
kitti (Geiger, Lenz, and Urtasun 2012) and Cityscapes, https://www.cityscapes- dataset. 
com (Cordts, Omran ef al. 2016) 


Table 6.2 lists datasets and benchmarks for image classification, general object detec- 
tion, and segmentation. Two recent workshops that highlight the latest results on these 
datasets are the Robust Vision Challenge Zendel et al. (2020) (see Table C.1) and the 
COCO + LVIS Joint Recognition Challenge Kirillov, Lin et al. (2020). 


Datasets and benchmarks for fine-grained category recognition can be found at the 
CVPR Workshop on Fine-Grained Visual Categorization, https://sites.google.com/view/ 


fgvc8 as well as some of the papers on this topic discussed in Section 6.2.2. 


Table 6.3 lists some datasets for video understanding and action recognition. 


l See, e.g., https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. 
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Table 6.4 lists some widely used datasets for vision and language research, which in- 
cludes image captioning, dense annotation, visual question answering, and visual dia- 
log. 


Chapter 7: Feature detection and matching 


The HPatches dataset and benchmark (Balntas, Lenc et al. 2020) is often used to eval- 
uate new feature detectors and descriptors. 


The Image Matching Benchmark (Jin, Mishkin et al. 2021) is also widely used and has 
associated workshops. 


Visual localization datasets such as Aachen Day-Night (Sattler, Maddern et al. 2018) 
are also often used. 


Pointers to datasets for evaluating instance retrieval algorithms can be found in Zheng, 
Yang, and Tian (2018). 


Non-semantic image segmentation (splitting an image into “reasonable pieces” without 
labeling their content) is not widely studied any more. Pointers to classic datasets such 
as the Berkeley Segmentation Dataset and Benchmark (Martin, Fowlkes et al. 2001) 
can be found in the first edition of this book (Szeliski 2010, Appendix C. 1). 


Chapter 9: Motion estimation 


The Middlebury optical flow evaluation website, https://vision.middlebury.edu/flow 
(Baker, Scharstein et al. 2011) continues to be used for evaluation, since it contains 


a variety of short real-world sequences. 


Most optical flow algorithms are evaluated on the Sintel dataset, http://sintel.is.tue. 
mpg.de (Butler, Wulff ef al. 2012), since it contains both training and test subsets and 


an active leaderboard, although the videos are stylized computer animations. 


Many algorithms also train and test on the KITTI flow benchmark (Geiger, Lenz, and 
Urtasun 2012), although it only contains videos acquired from a driving vehicle. The 
computer-generated sequences in the VIsual PERception (VIPER) benchmark (Richter, 
Hayder, and Koltun 2017) also contain driving sequences. Mayer, Ilg et al. (2018, 
Table 1) tabulates widely-used datasets for optical flow and depth estimation and shows 


some sample images in Figure 1. 
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e A comparison of flow algorithm performance across different datasets (listed in Ta- 
ble C.1) can be found in the Robust Vision Challenge workshop (http://www.robustvision. 
net). 


For video object segmentation, the Densely Annotated VIdeo Segmentation (DAVIS) 
dataset Pont-Tuset, Perazzi et al. (2017) contains a set of widely-used evaluation video 
clips with ground-truth segmentation data. There is also a newer, larger, dataset called 
YouTube-VOS (Xu, Yang ef al. 2018) with its own associated set of challenges and 
leaderboards. 


Datasets for video object tracking (VOT) and multiple object tracking (MOT) can be 
found at the associated workshops (Kristan, Leonardis et al. 2020; Dendorfer, Osep 
et al. 2021). A wider range of objects to track can be found in the Track Any Object 
(TAO) dataset by Dave, Khurana et al. (2020). 


Chapter 10: Computational photography 


The High Dynamic Range radiance maps captured by Debevec and Malik (1997) at 
https://www.debevec.org/Research/HDR are still the go-to place to find high-quality 
HDR images. 


The RealSR real-world super-resolution dataset developed by Cai, Zeng et al. (2019) 
can be used to train and test SR algorithms on real imaging degradations. This dataset 
forms the basis for the NTIRE challenges on real image super-resolution (Cai, Gu et al. 


2019), which provide empirical comparisons of recent deep network-based algorithms. 


The latest benchmark for comparing image denoising algorithms, the NTIRE 2020 
Challenge on Real Image Denoising (Abdelhamed, Afifi et al. 2020), is based on a 
smartphone image denoising dataset (SIDD) (Abdelhamed, Lin, and Brown 2018) cre- 


ated by averaging sets of real-world noisy images. 


Thea alpha matting evaluation website, http://alphamatting.com (Rhemann, Rother et 


al. 2009) provides a standard set of test images and a leaderboard. 


The video matting dataset at https://videomatting.com (Erofeev, Gitman ef al. 2015) 


provides stop-motion animation videos created by carefully hand-matting each frame. 


Lin, Ryabtsev ef al. (2021) describe a high-resolution real-time video matting system 
along with two new video and image matting datasets. 


The AIM 2020 Workshop and Challenges on image inpainting (Ntavelis, Romero et al. 


2020a) provides datasets for evaluating such algorithms. 
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Chapter 11: Structure from motion and SLAM 


The Benchmark for 6DOF Object Pose (BOP) developed by Hodañ, Michel et al. 
(2018) has results from the recent challenge and workshop at https://bop.felk.cvut.cz/ 
challenges/bop-challenge-2020 and http://cmp.felk.cvut.cz/sixd/workshop_2020. 


The Long-Term Visual Localization Benchmark, https://www.visuallocalization.net, 
includes datasets such as Aachen Day-Night (Sattler, Maddern et al. 2018) and InLoc 
(Taira, Okutomi et al. 2018) along with an associated set of challenges and workshop 
held at ECCV 2020. 


The 1DSfM collection of landmark images created by Wilson and Snavely (2014) 

(https://www.cs.cornell.edu/projects/1dsfm), which is an extension of the Photo Tourism 
dataset created by Snavely, Seitz, and Szeliski (2008a), is widely used to test large-scale 

structure from motion algorithms. The poses provided with this dataset, which were 

obtained using the software in Wilson and Snavely (2014), are generally considered 

as “ground truth” when testing more efficient algorithms, although they have never 

been geo-registered. The ETH3D, https://www.eth3d.net (Schóps, Schónberger et al. 

2017) and Tanks and Temples, https://www.tanksandtemples.org (Knapitsch, Park et 

al. 2017) datasets are also occasionally used. 


Some widely used benchmarks for SLAM systems include a benchmark for RGB-D 
SLAM systems (Sturm, Engelhard et al. 2012), the KITTI Visual Odometry / SLAM 
benchmark (Geiger, Lenz et al. 2013), the synthetic ICL-NUIM dataset (Handa, Whe- 
lan et al. 2014), the TUM monoVO dataset (Engel, Usenko, and Cremers 2016), the Eu- 
RoC MAV dataset (Burri, Nikolic et al. 2016), the ETH3D SLAM benchmark (Schóps, 
Sattler, and Pollefeys 2019a), and the GSLAM general SLAM framework and bench- 
mark (Zhao, Xu et al. 2019). Many of these are surveyed and categorized in the paper 
by Ye, Zhao, and Vela (2019), which was presented at the ICRA 2019 Workshop on 
Dataset Generation and Benchmarking of SLAM Algorithms for Robotics and VR/AR, 


https://sites.google.com/view/icra-2019-workshop/home. 


Chapter 12: Depth estimation 


The most widely used datasets and benchmarks for two-frame and multi-view stereo 
are listed in Tables 12.1 and C.1. Among these, Middlebury stereo, KITTI, and ETH3D 
maintain active leaderboards tabulating the performance of two-frame stereo algo- 
rithms. For multi-view stereo, ETH3D and Tanks and Temples have leaderboards, and 
DTU is widely used and self-reported in papers. 
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Stereo Flow Depth Obj. Det. Semantic Instance Panoptic 
ADE20K! X 
coco? X X X X 
Cityscapes? X X X 
ETH3D* X 
HD1K* X 
KITTI° X X X X X X 
MVD’ X X X X 
Middlebury’| X X 
MPI Sintel? X X 
Objects365'% X 
OID!!! X X 
rabbitai!? X 
ScanNet!? X X 
VIPER '* X X X X X 
WildDash!* X X X 


http://sceneparsing.csail.mit.edu (Zhou, Zhao et al. 2019) 
2 http://cocodataset.org (Lin, Maire et al. 2014) 

> https://www.cityscapes-dataset.com (Cordts, Omran ef al. 2016) 

* https://www.eth3d.net (Schéps, Schönberger et al. 2017) 

> http://hci-benchmark.org (Kondermann, Nair et al. 2016) 

6 http://www.cvlibs.net/datasets/kitti (Menze and Geiger 2015) 

7 http://mapillary.com/dataset/vistas (Neuhold, Ollmann et al. 2017) 

$ http://vision.middlebury.edu (Scharstein, Hirschmiiller et al. 2014) 

? http://sintel.is.tue.mpg.de (Butler, Wulff er al. 2012) 

10 https://www.objects365.org (Shao, Li er al. 2019) 

1 https://storage.googleapis.com/openimages/web/index.html (Kuznetsova, Rom et al. 2020) 
12 https://rabbitai.de/benchmark (Schilling, Gutsche et al. 2020) 

1 http://kaldir.vc.in.tum.de/scannet_benchmark (Dai, Chang et al. 2017) 

14 https://playing-for-benchmarks.org (Richter, Hayder, and Koltun 2017) 


15 https://www.wilddash.cc (Zendel, Honauer et al. 2018) 


Table C.1 The list of seven challenges (one per column) in the Robust Vision Challenge 
2020 (http://www. robustvision.net) along with the datasets and benchmarks that are included 
in each challenge. 
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e Many algorithms that train and test on the same dataset (e.g., KITTD do not perform 
as well when tested on different datasets (Zendel et al. 2020). Song, Yang et al. (2021) 


discuss this issue and domain adaptation techniques that can reduce this problem. 


e KeystoneDepth has a large set of rectified historical image pairs, but without ground 
truth depth (Luo, Kong et al. 2020). 


e For monocular depth inference, many algorithms train and test on the KITTI outdoor 
driving image sequences. The MiDaS system developed by Ranftl, Lasinger et al. 
(2020) federates a number of monocular depth inference datasets and also adds thou- 
sands of stereo image pairs from 3D movies for training, validation, and testing. 


Chapter 13: 3D reconstruction 


e The DiLiGenT photometric stereo dataset provides images taken under calibrated di- 
rectional lighting and objects with general reflectance along with ground truth shapes 
(Shi, Mo et al. 2019). It also provides a taxonomy and evaluation of photometric stereo 


methods for general non-Lambertian materials and unknown lighting. 


NYU3D (Silberman, Hoiem et al. 2012) and ScanNet (Dai, Chang et al. 2017) were 
some of the early 3D indoor scene datasets used to study 3D reconstruction and range 
fusion algorithms. More recent algorithms such as Chabra, Lenssen et al. (2020) or 
Weder, Schonberger et al. (2021) use some combination of 3D Scenes (Zhou and 
Koltun 2013), ICL-NUIM (Handa, Whelan et al. 2014), ShapeNet (Chang, Funkhouser 
et al. 2015), and Tanks and Temples (Knapitsch, Park ef al. 2017). Reviews of RGB-D 
datasets can be found in Firman (2016) and Zollhófer, Stotko et al. (2018). 


Over the years, a number of 3D human body and motion datasets have been captured, 
including HumanEva (Sigal, Balan, and Black 2010), MPI FAUST (Bogo, Romero et 
al. 2014), Panoptic Studio (Joo, Simon et al. 2019), EHF (Pavlakos, Choutas et al. 
2019), AMASS (Mahmood, Ghorbani et al. 2019), and 3D Poses in the Wild (3DPW) 
(von Marcard, Henschel et al. 2018).? 


In parallel with these datasets, 3D human body models and fitting algorithms have 
been developed, including SCAPE (Anguelov, Srinivasan ef al. 2005), BlendSCAPE 
(Hirshberg, Loper et al. 2012). SMPL (Loper, Mahmood et al. 2015), MANO (Joo, 
Simon, and Sheikh 2018), SMPL-X (Pavlakos, Choutas et al. 2019), VIBE (Kocabas, 


2 Additional datasets can be found on the MPI Perceiving Systems https://ps.is.mpg.de/code and Virtual Humans 
group https://virtualhumans.mpi-inf.mpg.de/software.html web pages. 
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Athanasiou, and Black 2020), ExPose (Choutas, Pavlakos et al. 2020), STAR (Osman, 
Bolkart, and Black 2020), Learned Gradient Descent (Song, Chen, and Hilliges 2020), 
and FrankMoCap (Rong, Shiratori, and Joo 2020). These are described in more detail 
in Section 13.6.4. 


Chapter 14: Image-based rendering 


e The original Photo Tourism dataset created by Snavely, Seitz, and Szeliski (2008a) 
was extended by Wilson and Snavely (2014) to the much larger 1DSfM collection of 
landmark images at https://www.cs.cornell.edu/projects/1dsfm. 


The Stanford Light Field Archive, http://lightfield.stanford.edu (Wilburn, Joshi et al. 
2005) and the 4D Light Field Dataset, https://lightfield- analysis.uni-konstanz.de (Honauer, 
Johannsen et al. 2016) both provide high-quality light fields for research and projects. 


The Virtual Viewpoint Video multi-viewpoint video with per-frame depth maps, https:// 
www.microsoft.com/en-us/research/group/interactive- visual-media/#!downloads (Zit- 
nick, Kang et al. 2004) continues to be widely used for research into 3D and multi-view 
video compression. Newer multi-view video datasets include Facebook Surround 360, 
https://github.com/facebook/Surround360 (Parra Pozo, Toksvig et al. 2019) and Deep 
View Video https://augmentedperception.github.io/deepviewvideo (Broxton, Flynn et 
al. 2020). 


Most of the recent Neural Rendering papers discussed in Section 14.6 either provide 


their own multi-view datasets or re-use datasets from previously published papers. 


C.2 Software 


Since the publication of the first edition of this book, when high quality open source computer 
vision software was still scarce, the last decade has seen an explosion in such software. Most 
research papers today come with open source software implementation, often tested on well- 
known datasets. The web site Papers with Code (https://paperswithcode.com) lists many of 
the latest machine learning research papers along with pointers to their implementations. 
When getting started in computer vision, many students either dive into using and ex- 
tending such code, or work through tutorials on deep learning frameworks such as PyTorch 
(https://pytorch.org/tutorials) or TensorFlow (https://www.tensorflow.org/tutorials). The Dive 
into Deep Learning book and web site (Zhang, Lipton et al. 2021) has associated Python 
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Notebooks, based on the Apache MXNet machine learning framework, which can be down- 
loaded and run as students are working through the material. 

For “classic” computer vision algorithms not based on deep learning, one of the best 
sources continues to be the Open Source Computer Vision (OpenCV) library (https://opencv. 
org), which was originally developed by Gary Bradski and his colleagues at Intel (Bradsky 
and Kaehler 2008; Kaehler and Bradski 2017). The library has more than 2500 optimized 
algorithms, which includes both classic and state-of-the-art computer vision and machine 
learning algorithms, with C++, Python, Java and MATLAB interfaces. 

For most of my research career, I did my software development in C++, since I liked 
its run-time efficiency, strong type checking, and object-oriented framework. In the last few 
years, however, I’ve shifted to Python. Having an interactive environment that does not re- 
quire re-compilation and linking is a big plus. Even better, the NumPy (https://numpy.org/) 
multidimensional array (tensor) library, when used in the right way, introduces developers to 
array-based (matrix) arithmetic and (hopefully) dissuades them from writing pixel-iteration 
loops that are slow to write and error-prone. A big advantage of writing in this fashion is that 
it maps closely to the abstractions used in the deep learning frameworks such as PyTorch and 
TensorFlow. It also often results in highly optimized code that can be run on both CPUs and 
GPUs with minimal changes.* 

In the rest of this section, I list some additional software packages and libraries that stu- 
dents may find useful. You can also find pointers to older (currently less used) software 
packages in the first edition of this book (Szeliski 2010, Appendix C.2). 


Chapter 3: Image processing 


e Before diving into OpenCV, I would encourage you to write some simple image pro- 
cessing functions in NumPy using the built-in multidimensional array notation. It’s 
fine to use OpenCV for image input/output and to use Matplotlib for visualization. 
There are also other high-level packages for image processing, such as scikit-image 
and PIL/Pillow. A more recently developed computer vision library is MMCV (https: 
//openmmlab.com/codebase#MMCV). 


As a warm-up exercise, before diving into machine learning but after doing the ba- 
sic PyTorch or TensorFlow tutorials, try porting your NumPy code into one of these 


languages. 


Another language that supports array-level functional programming is Halide (https: 


//nalide-lang.org) (Ragan-Kelley, Barnes ef al. 2013), which provides optimized com- 


3See, e.g., https://cupy.dev or https://devopedia.org/numpy. 
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pilation onto a large number of targets, including CPUs, GPUs, mobile processors, and 


DSPs such as the Qualcomm Hexagon. 


For wavelets, PyWavelets (https://pywavelets.readthedocs.io) has a nice extensive set 


of variants. 


I have always found it helpful to have an image viewer where I can quickly flip between 
aligned images to look for differences, which show up much better than when viewing 
images side-by-side. 


Chapter 4: Model fitting and optimization 


Scikit-learn (https://scikit-learn.org) implements a number of algorithms for regression, 
1.e., scattered data interpolation. 


OpenGM (http://hciweb2.iwr.uni-heidelberg.de/opengm) is a C++ template library for 
discrete factor graph models and distributive operations on these models. It includes 


state-of-the-art optimization and inference algorithms beyond message passing. 


Chapter 5: Deep learning 


Scikit-learn (https://scikit-learn.org) includes a large number of traditional machine 
learning algorithms and tutorials. Glassner (2018, Chapter 15) has a nice review of 


these algorithms along with some exercises. 


Over the last decade, a large number of deep learning software frameworks and pro- 
gramming language extensions have been developed. The Wikipedia entry on deep 


learning software lists over twenty such frameworks.* 


The Dive into Deep Learning book (Zhang, Lipton et al. 2021) and associated course 
(Smola and Li 2019) use MXNet for all the examples in the text, but they have recently 
released PyTorch and TensorFlow code samples as well. Stanford’s CS231n (Li, John- 
son, and Yeung 2019) and Johnson (2020) include a lecture on the fundamentals of 
PyTorch and TensorFlow. 


Some classes also use simplified frameworks that require the students to implement 
more components, such as the Educational Framework (EDF) developed by McAllester 
(2020) and used in Geiger (2021). 


4https://en.wikipedia.org/wiki/Comparison_of_deep-learning_software 
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PyTorch (https://pytorch.org) and TensorFlow (https: //www.tensorflow.org) are cur- 
rently the most widely used deep learning frameworks. Compared to NumPy, they 


enable much faster numerical computing by leveraging a GPU. 


Tensor Processing Units (TPUs) are specialized hardware optimized specifically for 
deep learning and can offer speed improvements over GPUs. TPUs are only available 
through Google Cloud. While they are still less popular than GPUs, many of the new 
papers using TPUs find it most effective to use JAX (https://github.com/google/jax). 


Even though deep learning frameworks provide some support for image augmentation, 
the imgaug library (https://github.com/aleju/imgaug) provides a much wider range of 
augmentation possibilities. 


VISSL (https://vissl.ai) is an extendable self-supervised learning framework written in 


PyTorch. It provides many benchmarks, model implementations, and weights. 


Google Colab (https://colab.research.google.com) is often used as a free cloud com- 
puting platform for the assignments in computer vision courses that can benefit from 
a GPU. It provides access to a GPU and memory to download datasets. The program- 
ming environment uses Jupyter interactive notebooks, which makes code easy to share 


and reproduce. 


Kaggle (https://www.kaggle.com), a Google subsidiary, provides a platform to compete 
with your own models on many popular computer vision datasets. The vast majority of 
winning models now using deep learning, with many of the challenges providing lively 
discussions about how different people attempted the problem and explored the data. 


Variants of the LeNet-5 architecture (Figure 5.33) are commonly used as the first con- 
volutional neural network introduced in courses and tutorials on the subject.? Although 
the MNIST dataset (LeCun, Cortes, and Burges 1998) originally used to train LeNet- 
5 is still sometimes used, it is more common to use the more challenging CIFAR-10 
(Krizhevsky 2009) or Fashion MNIST (Xiao, Rasul, and Vollgraf 2017). 


Andrej Karpathy provides a useful guide for training neural networks at https://karpathy. 
github.10/2019/04/25/recipe, which may help avoid common issues. 


A great way to experiment with various CNN architectures is to download pre-trained 
models from a model zoo such as the TorchVision library (https://github.com/pytorch/ 


vision). If you look in the torchvision/models folder, you will find implementations 


>See, e.g., https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. 
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of AlexNet, VGG, GoogleNet, Inception, ResNet, DenseNet, MobileNet, and Shuf- 
fleNet, along with other models for classification, object detection, and image seg- 
mentation. Even more recent models can be found in the PyTorch Image Models 
library (timm), https://github.com/rwightman/pytorch-image-models. Similar collec- 
tions of pre-trained models exist for other languages, e.g., https://www.tensorflow.org/ 


lite/models for efficient (mobile) TensorFlow models. 


In addition to software frameworks and libraries, deep learning code development 
usually benefits from good visualization libraries such as TensorBoard (https://www. 
tensorflow.org/tensorboard) and Visdom (https://github.com/fossasia/visdom). A great 
way to get some intuition on how deep networks update the weights and carve out 
a solution space during training is to play with the interactive visualization at https: 
//playground.tensorflow.org, as shown in Figure 5.32.° OpenAI also recently released a 
great interactive tool called Microscope (https://microscope.openai.com/models), which 
allows people to visualize the significance of every neuron in a network. 


The PyTorch3D library (https://github.com/facebookresearch/pytorch3d) provides rep- 
resentations and functions to process 3D volumes and 3D meshes using deep neural 


networks. 


Chapter 6: Recognition 


e For large-scale similarity search and clustering, the GPU-enabled Faiss library (https: 
//github.com/facebookresearch/faiss) developed by Johnson, Douze, and Jégou (2021) 


can scale to very large datasets. 


There are many open-source frameworks such as Classy Vision (https://classyvision. 
ai), TensorFlow Core (https://www.tensorflow.org/tutorials/images/classification), and 
MM Classification (https://openmmlab.com) for training and fine tuning image and video 
classification models. You can also upload your images to the Computer Vision Ex- 
plorer (https://vision-explorer.allenai.org) to see how well popular computer vision 


models perform on them. 


Open-source frameworks for training and fine-tuning object detectors include the Ten- 
sorFlow Object Detection API (https://github.com/tensorflow/models/tree/master/research/ 
object_detection), PyTorch’s Detectron2 (https://github.com/facebookresearch/detectron2), 
and OpenMMLab’s MMDetection (https://openmmlab.com/codebase#MMDetection) 
(Chen, Wang ef al. 2019). 


6 Additional interactive demonstrations can be found at https://cs.stanford.edu/people/karpathy/convnetjs. 
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Detectron2 also includes semantic and panoptic segmentation, which can also be found 


in TensorFlow Core (https://www.tensorflow.org/tutorials/images/segmentation) and many 


other libraries. 


OpenPose (Cao, Hidalgo et al. 2019) and DensePose (Giiler, Neverova, and Kokki- 
nos 2018) are two popular software packages for determining “stick figure” and dense 
pixel-labeled 3D pose from 2D images. 


Pointers to software for more specialized tasks such as face detection and recogni- 
tion, pedestrian detection, video understanding, and vision and language can usually 
be found alongside the latest papers discussed in Chapter 6. 


Chapter 7: Feature detection and matching 


Implementations of many of the “classic” feature detectors and descriptors can be found 
in the OpenCV Features2D class and sub-classes. 


Implementations of newer DNN-based detectors and descriptors can be found associ- 
ated with the papers discussed in Chapter 7 and the datasets discussed in Appendix C.1. 


Chapter 9: Motion estimation 


The leaderboards (evaluation results) for the Middlebury (https://vision.middlebury. 


edu/flow/eval/results/results-e1 php), Sintel (http://sintel.is.tue.mpg.de/results), and KITTI 


(http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=flow) datasets con- 


tain pointers to the latest optical flow papers and code. 


Chapter 10: Computational photography 


Pointers to papers and algorithms for a variety of computational photography tasks such 
as super-resolution, image denoising, image and video matting, and inpainting can be 
found at the benchmarks and workshops associated with these topics, as discussed in 
Chapter 10 and the list of datasets in Appendix C.1. 


Chapter 11: Structure from motion and SLAM 


OpenCV implements a number of widely used camera calibration and pose estima- 
tion algorithm in the calib3d module, as does OpenGV (https://laurentkneip.github. 
io/opengv) (Kneip and Furgale 2014) and OpenMVG (https://github.com/openMVG/ 
openMVG) (Moulon, Monasse et al. 2016). 
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e You can find an experimental comparison of a number of RANSAC variants at https: 


//opencv.org/evaluating-opencvs-new-ransacs/. 


e A large number of open-source bundle adjustment algorithms designed to handle un- 
ordered photo collections have been developed over the years, including: 


— SBA: sparse bundle adjustment (https://www.ics.forth.gr/~lourakis/sba) (Lourakis 
and Argyros 2009). 
— Simple sparse bundle adjustment (SSBA) (https://github.com/chzach/SSBA). 


— Bundler, structure from motion for unordered image collections (https://phototour. 
cs.washington.edu/bundler) (Snavely, Seitz, and Szeliski 2006). 


— The Ceres Solver for bundle adjustment and general non-linear least squares (http: 


//ceres-solver.org). 


— MCBA (Multicore Bundle Adjustment) (https://grail.cs.washington.edu/projects/ 
mcba) (Wu, Agarwal et al. 2011). 


— Visual SfM (http://ccwu.me/vsfm), which wraps a GUI around several reconstruc- 
tion algorithms (Wu, Agarwal et al. 2011; Wu 2013). 


— MVE (https://www.gcc.tu-darmstadt.de/home/proj/mve), a complete SfM pipeline 
with densification, meshing, and texturing (Fuhrmann, Langguth ef al. 2015). 


— The Theia global structure from motion library (http://www.theia-sfm.org) (Sweeney, 
Hollerer, and Turk 2015). 


— OpenMVG (Open Multiple View Geometry) https://github.com/openMVG/openMVG 
(Moulon, Monasse et al. 2016). 


— COLMAP (https://github.com/colmap/colmap), which includes both a large-scale 
structure from motion system (Schónberger and Frahm 2016) and a multi-view 


stereo pipeline (Schónberger, Zheng ef al. 2016). 


— Square Root Bundle Adjustment (https://vision.in.tum.de/research/vslam/rootba) 
(Demmel, Sommer et al. 2021). 


Among these, COLMAP appears to be the most often used today in other research 


projects, e.g., for image-based rendering systems. 


e Popular open-source packages for Simultaneous Localization and Mapping (SLAM) 
and Visual Odometry (VO or VIO) include 


— LSD-SLAM (large-scale direct SLAM) (Engel, Schóps, and Cremers 2014), 
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— ORB-SLAM (Mur-Artal, Montiel, and Tardos 2015) and ORB-SLAM2 (Mur- 
Artal and Tardós 2017), 


— SVO (semi-direct visual odometry) (Forster, Zhang et al. 2017), 
— GTSAM (Dellaert and Kaess 2017; Dellaert 2021), 
— DSO (direct sparse odometry) (Engel, Koltun, and Cremers 2018), 


— BAD SLAM (bundle adjusted direct RGB-D SLAM) (Schóps, Sattler, and Polle- 
feys 2019a), and 


— GSLAM (a general SLAM framework and benchmark) (Zhao, Xu et al. 2019). 
e There are also highly-optimized SLAM/VIO libraries available on mobile platforms, 
such as iOS (ARKit), Android (ARCore), and Facebook (Spark AR Studio), designed 


for easy integration into mobile augmented reality applications. 


Chapter 12: Stereo correspondence 


e Open-source software for the latest stereo matching, multi-view, and monocular depth 
inference algorithms usually accompanies recently published papers. Lists of the most 
recent and best performing algorithms can be found on the leaderboards associated 
with the most popular benchmarks such as Middlebury, KITTI, ETH3D, and Tanks 
and Temples, which are discussed in Appendix C.1 and Tables 12.1 and C.1. algorithm 


Both MVE (https://www.gcc.tu-darmstadt.de/home/proj/mve) (Fuhrmann, Langguth et 
al. 2015) and COLMAP (https://github.com/colmap/colmap) (Schónberger, Zheng et 
al. 2016) provide complete 3D reconstruction pipelines that include structure from mo- 
tion, multi-view stereo densification, mesh generation, and texturing. A review of ear- 
lier packages can be found in Furukawa and Hernández (2015). 


A number of high-quality commercial photogrammetry packages such as CapturingRe- 
ality, ContextCapture, Metashape, and Pix4D, which grew out of computer vision re- 


search labs, provide similar functionality.’ 


Chapter 13: 3D reconstruction 


e The Scanalyze package (https://graphics.stanford.edu/software/scanalyze) developed 
at the Stanford Graphics lab contains a number of algorithms for aligning, registering, 


and fusing range images and 3D meshes. 


7See also https://peterfalkingham.com/2020/07/10/free-and-commercial-photogrammetry-software-review-2020 


and https://all3dp.com/1/best-photogrammetry-software. 
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Open3D (http://www.open3d.org) is a more recent package with similar registration 


and volumetric merging capabilities (Zhou, Park, and Koltun 2018). 


MeshLab (https://www.meshlab.net) is a widely used package for processing, editing, 
and viewing 3D triangular meshes (Cignoni, Callieri ef al. 2008). 


X3D is an XML-based format for representing 3D geometry and is an updated version 
of the original VRML (.wrl) format. A number of high-quality interactive viewers can 


be found on the web. 


The Point Cloud Library (PCL) at https://pointclouds.org is a library for point cloud 
processing and includes functions for feature detection, registration, segmentation, and 


visualization. 


As mentioned previously, both MVE and COLMAP have functions to generate 3D 
texture-mapped meshes (Fuhrmann, Langguth ef al. 2015; Schónberger, Zheng et al. 
2016). 


Canvas (https://canvas.io) is a phone-based 3D capture app that merges depth data from 


the phone’s lidar sensor to produce complete textured 3D meshes. 


Chapter 14: Image-based rendering 


e As with other areas of computer vision, most recently published image-based rendering 


and neural rendering papers now come with open source implementations. 


Appendix A: Linear algebra and numerical techniques 


e The first edition of this book (Szeliski 2010, Appendix C.2) lists a number of widely 
used linear algebra and non-linear least squares packages such as BLAS, LAPACK, 
ATLAS, MKL, MINPACK, PARADISO, TAUCS, HSL, and ITSOL. Most of these 
are now integrated into larger packages such as Python's NumPy and GPU machine 


learning frameworks such as PyTorch and TensorFlow. 


If you are interested in sparse linear least squares solvers, it is worth looking at SuiteS- 
parse (https://people.engr.tamu.edu/davis/suitesparse.html), since it contains a wide range 
of algorithms and associated publications (Davis 2006, 2011). 
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Appendix B: Bayesian modeling and inference 


e The Middlebury benchmark for MRF minimization, https://vision.middlebury.edu/MRF/ 
code, contains implementations of basic MRF inference algorithms (Szeliski, Zabih et 
al. 2008). 


e The OpenGM2 library and benchmarks for discrete factor graph models, http://hciweb2. 
iwr.uni-heidelberg.de/opengm, contains a more extensive and up-to-date set of algo- 
rithms. (Kappes, Andres ef al. 2015). 


C.3 Slides and lectures 


While there are no official slide sets to go with this book, its content largely parallels that 
of the courses I have co-taught at the University of Washington, https://www.cs.washington. 
edu/education/courses/cse576. 

Related computer vision and deep learning classes include: 


e Noah Snavely’s Introduction to Computer Vision class at Cornell Tech, https://www. 
cs.cornell.edu/courses/cs5670/202 1 sp/ 


Alyosha Efros’ Intro to Computer Vision and Computational Photography class at 
Berkeley https://inst.eecs.berkeley.edu/~cs 194-26/fa20. 


David Fouhey’s and Justin Johnson’s Computer Vision class at the University of Michi- 
gan, https://web.eecs.umich.edu/~justincj/teaching/eecs442. 


Bill Freeman, Antonio Torralba, and Phillip Isola’s Advances in Computer Vision class 
at MIT http://6.869.csail.mit.edu/sp21. 


Justin Johnson’s Deep Learning for Computer Vision class at the University of Michi- 


gan, https://web.eecs.umich.edu/~justincj/teaching/eecs498. 


Yann LeCun and Alfredo Canziani’s Deep Learning class at NYU, https://atcold.github. 
io/NYU-DLSP21. 


UC Berkeley’s class on Deep Unsupervised Learning, 
https://sites.google.com/view/berkeley-cs294-158-sp20. 


You can find a more comprehensive list of such courses on the book’s web site, https:// 
szeliski.org/Book/default.htm#Slides. 
There are also some great online lectures series, including: 
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e The 2004 UW-MSR Course on Vision Algorithms, http://www.cs.washington.edu/education/ 
courses/cse577/04sp/index.htm. 
e The 2020-2021 TUM AI Guest Lecture Series, https://niessner.github.io/TUM-AI-Lecture-Series. 


e The 2020-2021 3DGV virtual seminar series on Geometry Processing and 3D Com- 
puter Vision, https://3dgv.github.io. 
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Index 


3D Rotations, see Rotations 

3D alignment, 513 
absolute orientation, 513, 821 
orthogonal Procrustes, 513 

3D convolutional neural networks, 317 

3D model capture, 854 

3D photography, 854, 872, 873 

3D scanning, 816 

3D video, 893 


Absolute orientation, 513, 821 
Activation functions, 272 
rectified linear unit (ReLU), 272 
sigmoid, 272 
Active appearance model (AAM), 366 
Active contours, 467 
Active illumination, 816 
Active rangefinding, 816 
Active shape model (ASM), 366, 471 
Active stereo, 819 
Activity recognition, 850 
Adadelta, 290 
AdaGrad, 289 
Adam optimization algorithm, 290 
Adaptive smoothing, 135 
Adversarial examples, 311, 403 
Affine transforms, 41, 45 
Affinities (segmentation), 489 


AlexNet neural network, 299 
Algebraic multigrid, 486 
Algorithms 
testing, viii 
Aliasing, 84, 616 
Alignment, see Image alignment 
Alpha 
opacity, 114 
pre-multiplied, 114 
Alpha matte, 114 
Ambient illumination, 71 
Analog to digital conversion (ADC), 83 
Anisotropic diffusion, 135 
Anisotropic filtering, 174 
Anti-aliasing filter, 85, 616 
Aperture, 75 
Aperture problem, 568 
Applications, 5 
3D model capture, 854 
3D model reconstruction, 725 
3D photography, 854, 872 
augmented reality, 697, 739 
automotive safety, 5 
background replacement, 777 
biometrics, 363 
colorization, 211 
digit classification, 298 
digital heritage, 824 
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document scanning, 517 

edge editing, 465 

facial animation, 839 

flash photography, 634 

frame interpolation, 593 

gaze correction, 769 

head tracking, 769 

hole filling, 665 

image search, 360 

industrial, 7 

intelligent photo editing, 394 
Internet photos, 725 

location recognition, 698 

machine inspection, 5 

match move, 723 

medical imaging, 5, 390, 577 
morphing, 177 

mosaic-based video compression, 522 
non-photorealistic rendering, 667 
Optical character recognition (OCR), 5 
panography, 506 
performance-driven animation, 454 
photo pop-up, 394 

Photo Tourism, 867 

Photomontage, 544 

planar pattern tracking, 697 

rolling shutter wobble removal, 587 
rotoscoping, 476 

scene completion, 394 

scratch removal, 665 

segmentation, 227 

self-driving vehicles, 5 

single view reconstruction, 688 
style transfer, 669 

synthetic re-focusing, 883 

tonal adjustment, 119 

video denoising, 589 

video stabilization, 573 


video summarization, 522 


video-based walkthroughs, 896 
view morphing, 714 
visual effects, 5 
whiteboard scanning, 517 
z-keying, 777 
Arc length parameterization of a curve, 463 
Architectural reconstruction, 833 
Area statistics, 141 
mean (centroid), 141 
perimeter, 141 
second moment (inertia), 141 
Aspect ratio, 57, 59 
Atrous convolution, 294 
Attention, 323 
Augmented reality, 550, 697, 739 
Auto-calibration, 712 
Autoencoder, 296, 329 
Automatic gain control (AGC), 82 
Average pooling, 295 


Axis/angle representation of rotations, 46 


B-snake, 469 
B-spline, 176, 469, 473, 477, 578 
cubic, 150 
multilevel, 826 
octree, 831 
Backbone network, 296 
Background plate, 662 
Background subtraction (maintenance), 844 
Backpropagation, 269, 284 
gradient checkpointing, 287 
guided, 308 
Backside illumination (back-illuminated) sensor, 
82 
Backward convolution, 295 
Bag of words (keypoints), 352 
distance metrics, 353 
Band-pass filter, 127 
Bartlett filter, see Bilinear kernel 


Barycentric coordinates, 194 


Index 


Batch channel normalization, 279 
Batch normalization, see Deep neural networks 
Bayer pattern (RGB sensor mosaic), 93 
demosaicing, 93, 646 
Bayes’ rule, 213, 948 
MAP (maximum a posteriori) estimate, 949 
posterior distribution, 948 
Bayesian classification, 243 
Bayesian modeling, 212, 948 
MAP estimate, 213, 949 
matting, 654 
posterior distribution, 213, 948 
prior distribution, 213, 948 
uncertainty, 213 
Belief propagation (BP), 219 
Benchmarks, 954 
Bias, 112, 560 
Bias-variance tradeoff, 202 
Bidirectional Reflectance Distribution Function, 
see BRDF 
Bilateral filter, 133, 184 
joint, 635 
range kernel, 134 
tone mapping, 630 
Bilateral solver, 210 
Bilinear blending, 118 
Bilinear kernel, 126 
Biometrics, 363 
Bipartite problem, 720 
Blind image deconvolution, 638 
Block-based motion estimation 
(block matching), 562 
Blocks world, 12 
Blue screen matting, 115, 180, 651 
Blur kernel, 75 
estimation, 616, 676 
Blur removal, 148, 183 
Body color, 70 
Boltzmann distribution, 214, 949 
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Boosting, 374 
decision stump, 374 
weak learner, 374 
Border (boundary) effects, 123, 182 
Boundary detection, 461 
Box filter, 125 
Boxlet, 130 
BRDE, 68 
anisotropic, 68 
isotropic, 68 
recovery, 852 
spatially varying (SVBRDP), 852 
Brightness, 112 
Brightness constancy, 3, 558 
constraint, 558, 567, 580 
Bundle adjustment, 717 


C3D network, 318 
Calibration, see Camera calibration 
Calibration matrix, 56 
Camera calibration, 55, 105 
accuracy, 743 
aliasing, 616 
extrinsic, 56, 693 
intrinsic, 55, 685 
optical blur, 616, 676 
patterns, 685 
photometric, 610 
plumb-line method, 692, 745 
point spread function, 616, 676 
radial distortion, 691 
radiometric, 611, 621, 674 
rotational motion, 689, 743 
slant edge, 616 
vanishing points, 687 
vignetting, 615 
Camera matrix, 56, 59 
Catadioptric optics, 77 
Category-level recognition, 349 
bag of words, 352 
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part-based, 354 

surveys, 410 
CCD, 80 

blooming, 80 
Central difference, 126 
Chain rule 

as used in backpropagation, 284 
Chained transformations, 696, 718 
Chamfer matching, 139 


Channel-separated convolution, 294 


Characteristic function, 141, 474, 824, 831 


Characteristic polynomial, 924 

cheirality, 703, 708 

Cholesky factorization, 925 
algorithm, 926 
incomplete, 937 
sparse, 933 

Chromatic aberration, 77, 746 

Chromaticity coordinates, 90 

CIE L*a*b*, see Color 

CIE L*u*v*, see Color 

CIE XYZ, see Color 

Circle of confusion, 75 

CLAHE, see Histogram equalization 

Classification, 237, 239 
Bayesian, 243 

CLIP, 315, 403 

Closing, 138 

Clustering, 257 
agglomerative, 258, 485 
cluster analysis, 483, 494 
divisive, 258, 484 

CMOS, 80 

CNN stereo matching costs, 779 

Co-vector, 42 

Coefficient matrix, 208 


camera, 92 

demosaicing, 93, 646 

fringing, 646 

hue, saturation, value (HSV), 97 

L*a*b*, 90 

L*u*v*, 91, 259, 487 

primaries, 88 

profile, 613 

ratios, 98 

RGB, 89 

transform, 112 

twist, 94, 113 

XYZ, 89 

YIQ, 96 

YUV, 96 
Color filter array (CFA), 93, 646 
Color line model, 657 
ColorChecker chart, 613 
Colorization, 211 
Compositing, 113, 178, 180 

image stitching, 536 

opacity, 114 

over operator, 114 

surface, 536 

transparency, 114 
Compression, 98 
Computational photography, 607 

active illumination, 636 

flash and non-flash, 634 

high dynamic range, 620 

references, 610, 671 

tone mapping, 627 
Concentric mosaic, 523, 882 
CONDENSATION, 472 
Condition number, 934 


Conditional batch normalization, 279 


Collineation, 45 Conditional generative adversarial network, 333 
Color, 87 Conditional random field (CRF), 222, 388, 771 
balance, 94, 104, 180 dense, 225 


Index 


fully connected, 225 
Confidence calibration, 280 
Confusion matrix (table), 441 
Conic section, 37 
Conjugate gradient descent (CG), 934 

algorithm, 935 

non-linear, 936 

preconditioned, 936 
Connected components, 141 
Constellation model, 356 


Content-based image retrieval (CBIR), 360, 448 


Content-preserving warps, 533, 575 
Continuation method, 210 
Contour 

arc length parameterization, 463 

chain code, 463 

detection, 461 

matching, 464, 498 

smoothing, 464 
Contrast, 112 
Contrastive (metric) learning, 315 
Contrastive loss, 281, 282 
Controlled-continuity spline, 205 
Convolution, 120 

kernel, 120 

mask, 120 

superposition, 120 
Convolutional neural networks 


1 x 1 convolutions, 293 


Convolutional neural networks (CNNs), 291 


Coring, 157, 186 
Correlation, 120, 561 
windowed, 564 


Correspondence map, 571 


Cramer—Rao lower bound, 513, 569, 952 


Cross-entropy loss, 249, 280 
multi-class, 249 
Cross-validation, 201 


Cube map 


Hough transform, 479 
image stitching, 536 
Curve 
arc length parameterization, 463 
evolution, 464 
matching, 464 
smoothing, 464 
Cylindrical coordinates, 523 


DALL-E, 331 
Data energy (term), 214, 949 
Data fitting 
robust, 202 
Data interpolation, 194 
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Dataset augmentation, see Deep neural networks 


Dataset bias, 312 
Datasets and test databases, 954 
Decimation, 153 
Decimation kernels 
bicubic, 154 
binomial, 153, 155 
QMF, 154 
windowed sinc, 153 
Decision theory, 240 
Decision trees and random forests, 254 
Deconvolution network, 308 
Deep learning, 237, 268 
courses, 336 
history, 336 
layers, 270 
surveys, 336 
textbooks, 336 
Deep neural networks, 268 
3D, 317 
3D point clouds and meshes, 320 
activation functions, 272 
adversarial examples, 311 
AlexNet, 299 
architectures, 299 
attention, 322 
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backbone, 296 stochastic gradient descent (SGD), 287 
backbone (trunk), 312 training, 287 
backpropagation, 284 transformer, 322 
batch normalization, 276 U-Net, 298 
bottleneck, 296 VGG, 300 
channels, 291 visualization, 307 
convolutional neural networks, 291 weight initialization, 283 
dataset augmentation, 275 weight sharing, 293 
deconvolution, 308 weights, 270 
dropout, 276 Delaunay triangulation, 194 
efficient (mobile) networks, 303 Demosaicing (Bayer), 93, 646 
feature maps, 291 Denoising 
fine-tuning, 312 image, 644 
fully convolutional, 296 video, 589 
generative adversarial networks (GANs), Dense captioning, 405 

331 Dense conditional random field (CRF), 225 
GoogLeNet, 300 Depth estimation 
group normalization, 279 monocular, 796 
He initialization, 283 Depth from defocus, 814 
head (branches), 312 Depth map, see Disparity map 
instance normalization, 278 Depth of field, 75, 103 
layer normalization, 278 Depth recovery, see Stereo 
learning rate, 288, 339 deep networks, 778 
loss functions, 280 multi-view, 781 
LSTMs, 321 Depthwise convolution, 294 
minibatch stochastic gradient descent, 288 DETR, 327 
model zoo, 304 Di-chromatic reflection model, 73 
momentum, 289 Difference matting (keying), 115, 181, 652, 844 
neural architecture search (NAS), 305 Difference of Gaussians (DoG), 158 
optimization, 287 Difference of low-pass (DOLP), 159 
pre-training, 312 Diffuse reflection, 70 
recurrent networks (RNNs), 321 Diffusion 
regularization, 274 anisotropic, 135 
ResNet, 302 Digital camera, 79 
sequence modeling, 321 color, 92 
size and efficiency, 305 color filter array (CFA), 93 
solvers, 287 compression, 98 
spatio-temporal, 321 Dilation, 138 


stereo matching, 778 Dimensionality reduction 


Index 


non-linear embeddings, 265 
Direct current (DC), 100 
Direct linear transform (DLT), 693 
Direct sparse matrix techniques, 932 
Directional derivative, 127 
selectivity, 129 
Discrete cosine transform (DCT), 98, 147 
Discrete Fourier transform (DFT), 144 
Discriminant analysis 
Fisher, 247 
linear, 247 
quadratic, 247 
Discriminative models 
decision trees and random forests, 254 
deep neural networks, 268 
feedforward networks, 268, 269, 274 
logistic regression, 248 
support vector machines, 250 
Discriminative random field (DRF), 223 
Disparity, 54, 756 
Disparity map, 757 
geometric consistency, 784 
multiple, 784 
Disparity space image (DSI), 757 
generalized, 759 
Displaced frame difference (DFD), 558 
Displacement field, 176 
Distance from face space (DFFS), 263 
Distance in face space (DIFS), 263 
Distance map, see Distance transform 
Distance transform, 139 
Euclidean, 140 
image stitching, 540 
Manhattan (city block), 139 
signed, 140 
Domain (of a function), 111 
Domain adaptation, 313 
Domain scaling law, 172 


Downsampling, see Decimation 


Downstream task, 313 
Dropout, see Deep neural networks 
DSAC, see RANSAC 
Dynamic programming (DP), 774 
monotonicity, 775 
ordering constraint, 775 
scanline optimization, 775 
Dynamic snake, 471 


Dynamic texture, 891 


Edge detection, 456, 496 
boundary detection, 461 
Canny, 457 
chain code, 463 
color, 460 
Difference of Gaussian, 458 
edgel (edge element), 458 
hysteresis, 463 
Laplacian of Gaussian, 458 
linking, 461, 497 
marching cubes, 458 
scale selection, 459 
steerable filter, 458 
zero crossing, 458 

Eigenface, 262 

Eigenvalue decomposition, 262, 470, 922 

Eigenvector, 922 

Elastic deformations, 577 
image registration, 577 

Elastic nets, 468 

Elliptical weighted average (EWA), 173 

Empirical risk minimization, 240 

Energy functions 
regular, 217 
sub-modular, 217 

Energy-based models, 191, 204 

Environment map, 67, 880 

Environment matte, 883 

Epipolar constraint, 704 

Epipolar geometry, 704, 753 
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pure rotation, 709 cascade of classifiers, 375 
pure translation, 708 clustering and PCA, 373 
Epipolar line, 754 neural networks, 373 
Epipolar plane, 754, 761 support vector machines, 374 
image (EPI), 782, 877 Face modeling, 838 
Epipolar volume, 877 Face recognition, 363 
Epipole, 705, 754 active appearance model, 366 
Erosion, 138 eigenface, 262 
Error rates elastic bunch graph matching, 365 
accuracy (ACC), 443 local binary patterns (LBP), 411 
false negative (FN), 441 local feature analysis, 365 
false positive (FP), 441 Face transfer, 888 
positive predictive value (PPV), 443 Facial expression recognition, 365 
precision, 443 Facial motion capture, 838, 843, 888 
recall, 443 Factor graph, 214, 215, 736, 949 


ROC curve, 443 
true negative (TN), 441 
true positive (TP), 441 
Errors-in-variable model, 528, 929 
heteroscedastic, 930 
ESAC, see RANSAC 


Essential matrix, 704 


Factorization, 15,715 
missing data, 716 
projective, 716 
Fast Fourier transform (FFT), 144 
Fast marching method (FMM), 474 
Feature descriptor, 434, 495 
bias and gain normalization, 435 
GLOH, 436 
patch, 435 
PCA-SIFT, 436 
performance (evaluation), 437 
quantization, 352, 447 
RootSIFT, 436 
SIFT, 435 
steerable filter, 437 
Feature detection, 419, 422, 495 


Adaptive non-maximal suppression, 426 


5-point algorithm, 707 
eight-point algorithm, 705 
re-normalization, 706 
seven-point algorithm, 706 
twisted pair, 708 

Estimation theory, 941 

Euclidean transformation, 40, 44 

Euler angles, 45 

Expectation maximization (EM), 261 

Exponential twist, 47 


Exposure bracketing, 621 


Exposure value (EV), 76, 611 affine invariance, 431 
auto-correlation, 422 

F-number (stop), 75, 104 Forstner, 425 

F-score, 381, 443 Harris, 425 

F-theta lens, 64 Laplacian of Gaussian, 429 

Face detection, 371 MSER, 432 


boosting, 374 region, 433 


Index 


repeatability, 428 
rotation invariance, 430 
scale invariance, 428 
Feature maps in deep neural networks, 291 
Feature matching, 419, 441, 496 
densification, 448 
efficiency, 445 
error rates, 441 
hashing, 445 
indexing structure, 445 
k-d trees, 446 
locality sensitive hashing, 446 
nearest neighbor, 443 
strategy, 441 
verification, 447 
Feature tracking, 452, 496 
affine, 452 
learning, 453 
Feature tracks, 715, 725 
Feature-based alignment, 503 
2D, 503 
3D, 513 
iterative, 507 
Jacobian, 504 
least squares, 504 
match verification, 348 
RANSAC, 511 
robust, 510 
Field of Experts (FoE), 219 
Fill factor, 82 
Fill-in, 720, 932 
Filter 
adaptive, 135 
band-pass, 127 
bilateral, 133, 184 
directional derivative, 127 
edge-preserving, 133, 135 
guided, 136, 184 


Laplacian of Gaussian, 127 
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median, 132 
moving average, 125 
non-linear, 132 
separable, 124, 182 
steerable, 128, 183, 184 
Filter coefficients, 120 
Filter kernel, see Kernel 
Finding faces, see Face detection 
Fine-grained category recognition, 359 
Fine-tuning deep neural networks, 312 
Finite element analysis, 207 
stiffness matrix, 208 
Finite impulse response (FIR) filter, 120, 130 
Fisher discriminant analysis, 247 
Fisher information matrix, 505, 513, 943, 952 
Fisheye lens, 64 
Flash and non-flash merging, 634 
Flash matting, 661 
Flip-book animation, 549 
Flying spot scanner, 818 
Focal length, 57, 58, 74 
Focus, 75 
shape-from, 814, 857 
Focus of expansion (FOE), 708 
Form factor, 74 
Forward mapping, see Forward warping 
Forward warping, 169, 187 
Fourier transform, 142, 184 
discrete, 144 
magnitude (gain), 143 
phase (shift), 143 
power spectrum, 146 
two-dimensional, 146 
Fourier-based motion estimation, 563 
Frame interpolation, 593 
Free-viewpoint video, 893, 895 
Fully connected (FC) layer, 272 
Fully connected conditional random field (CRF), 
225 
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Fully convolutional network, 296 
Fundamental matrix, 711 
estimation, see Essential matrix 


Fundamental radiometric relation, 79 


Gain, 112, 560 
Gamma, 112 
Gamma correction, 94, 104 
Gap closing (image stitching), 520 
Garbage matte, 662 
Gated convolution, 294 
Gaussian kernel, 126 
Gaussian Markov random field (GMRF), 218, 
224, 639 
Gaussian mixture model, 468, 472 
color model, 653 
expectation maximization (EM), 261 
mixing coefficient, 261 
Gaussian mixture models, 259 
Gaussian pyramid, 155 
Gaussian scale mixtures (GSM), 219 
Gaussians mixture model 
soft assignment, 261 
Gaze correction, 769 
Geman—McClure function, 559 
Generalized cylinders, 12, 820, 826 
Generalized mean pooling (GeM), 295 
Generative adversarial networks (GANs), 331 
conditional, 333 
discriminator, 331 
generator, 331 
Generative models, 193, 212, 243, 328, 948 
probabilistic generative classification, 243 
Geodesic active contour, 474 
Geodesic distance (segmentation), 230 
Geometric image formation, 36 
Geometric lens aberrations, 76 
Geometric primitives, 36 
homogeneous coordinates, 36 
lines, 36, 38 
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normal vectors, 36, 38 
planes, 38 
points, 36, 37 
Geometric transformations 
2D, 40, 168 
3D, 43 
3D perspective, 45 
3D rotations, 45 
affine, 41, 45 
bilinear, 43 
calibration matrix, 56 
collineation, 45 
Euclidean, 40, 44 
forward warping, 169, 187 
hierarchy, 41 
homography, 41, 45, 62, 517 
inverse warping, 170 
perspective, 41 
projections, 51 
projective, 41 
rigid body, 40, 44 
scaled rotation, 41, 45 
similarity, 41, 45 
translation, 40, 44 
Geometry image, 828 
Gesture recognition, 843 
Gibbs distribution, 214, 949 
Gimbal lock, 45 
Gist (of a scene), 358, 394 
Global illumination, 73 
GooLeNet neural network, 300 
Gradient checkpointing, 287 
Gradient location-orientation histogram (GLOH), 
436 
Graduated non-convexity (GNC), 209 
Graph cuts 
MRF inference, 216 
normalized cuts, 489 


Graph-based segmentation, 486 


Index 


Graphical models, 212, 214, 355 
Grassfire transform, 140, 465, 540 
Ground control points, 514, 707 
Group normalization, 279 

Guided image filter, 136, 184 


Hammersley—Clifford theorem, 214, 949 
Hand tracking, 846 
Harris corner detector, see Feature detection 
HDR imaging, see High dynamic range (HDR) 
imaging 
He initialization rule for neural network weights, 
283 
Head tracking, 769 
active appearance model (AAM), 366 
Helmholtz reciprocity, 68 
Hessian, 208, 426, 505, 508, 513, 567, 571, 928 
eigenvalues, 569 
image, 568, 581 
inverse, 513, 569 
local, 579, 580 
patch-based, 572 
rank-deficient, 724 
reduced motion, 720 
sparse, 720, 747, 932 
Heteroscedastic, 505, 930 
Hidden Markov model (HMM), 891 
Hierarchical motion estimation, 562 
High dynamic range (HDR) imaging, 620 
formats, 627 
tone mapping, 627 
video, 625 
Highest confidence first (HCF), 216 
Hilbert transform pair, 129 
Hinge loss, 253 
Hinton diagrams, 308 
Histogram equalization, 115, 181 
locally adaptive, 117, 182 
Histogram intersection, 353 
Histogram of oriented gradients (HOG), 376 


History of computer vision, 10 
Hole filling, 665 
Holistic 3D reconstruction, 836 
Homogeneous coordinates, 36, 703 
Homography, 41, 62, 517 
Hough transform, 478 
cascaded, 480 
cube map, 479 
generalized, 478 
Human activity recognition, 397 
Human body shape modeling, 847 
Human motion tracking, 843 
activity recognition, 850 
adaptive shape modeling, 847 
background subtraction, 844 
flow-based, 845 
initialization, 844 
kinematic models, 845 
particle filtering, 847 
probabilistic models, 847 
Hyper-Laplacian, 210, 218, 219 
Hyperlapse videos, 899 
Hyperparameters, 202, 290 


Ideal points, 36 
Ill-conditioned problems, 197 
Ill-posed (ill-conditioned) problems, 204 
Illusions, 3 
Image alignment 
feature-based, 503, 760 
intensity-based, 558 
Image analogies, 667 
Image blending 
feathering, 540 
GIST, 547 
gradient domain, 545 
image stitching, 538 
Poisson, 545 
pyramid, 165, 545 


Image center, 57 
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Image classification, 349 homography, 517 
Image compositing, see Compositing motion models, 516 
Image compression, 98 panography, 506 
Image decimation, 153 parallax removal, 531 
Image deconvolution, see Blur removal photogrammetry, 514 
Image denoising, 183, 644 pixel selection, 538 
Image filtering, see Filter planar perspective motion, 516 
Image formation recognizing panoramas, 533 
geometric, 36 rotational motion, 519 
photometric, 66 seam selection, 541 
Image gradient, 127, 136, 566 spherical, 524 
constraint, 207 up vector selection, 529 
Image interpolation, 150 Image Transformer, 326 
Image matting, 650, 678 Image warping, 168, 186, 563 
Image processing, 109 Image-based modeling, 865 
textbooks, 109, 178, 230 Image-based rendering, 596, 861 
Image pyramid, 149, 184 concentric mosaic, 882 
Image quality assessment, 149 environment matte, 883 
Image resampling, 168, 184 impostors, 869 
test images, 184 layered depth image, 868 
Image restoration, 148 layers, 869 
blur removal, 148, 183 light field, 875 
deblocking, 233 Lumigraph, 875 
denoising, 148, 183 modeling vs. rendering continuum, 886 
noise removal, 188 multiplane image (MPI), 871 
Image sensing, see Sensing reflections, 872 
Image statistics, 141 sprites, 869 
Image stitching, 514 surface light field, 880 
automatic, 533 unstructured Lumigraph, 879 
bundle adjustment, 527 view interpolation, 863 
compositing, 536 view-dependent texture maps, 865 
coordinate transformations, 537 Image-based visual hull, 795 
cube map, 536 Image-to-image translation, 333 
cylindrical, 523, 551 Implicit surface, 831 
deghosting, 532, 541, 552 Impostors, see Sprites 
exposure compensation, 547 Impulse response, 120 
feathering, 540 Inception module, 300 
gap closing, 520 Incremental refinement 


global alignment, 526 motion estimation, 562, 566 


Index 


Incremental rotation, 50 
Indexing structure, 445 
Indicator function, 831 
Inductive spatial locality bias, 323 
Industrial applications, 7 
Infinite impulse response (IIR) filter, 130 
Influence function, 210, 510, 946 
Information criteria (BIC, AIC), 202 
Information matrix, 505, 513, 724, 943, 952 
Inpainting, 665 
Instance normalization, 278 
Instance recognition, 346 

geometric alignment, 348 

inverted index, 448 

large-scale, 448 

match verification, 348 

query expansion, 450 

stop list, 449 

visual words, 449 
Integrability constraint, 811 
Integral image, 129 
Integrating sphere, 611 
Intelligent scissors, 473 
Interaction potential, 214, 215, 949 
Interactive computer vision, 854 
Interactive segmentation, 227 
International Color Consortium (ICC), 613 
Internet photos, 725 
Interpolation, 150 

scattered data, 194 
Interpolation kernels 

bicubic, 152 

bilinear, 150 

binomial, 150 

sinc, 152 

spline, 152 
Intrinsic camera calibration, 685 
Intrinsic images, 12 
Inverse kinematics (IK), 845 
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Inverse mapping, see Inverse warping 

Inverse problems, 3, 204 

Inverse warping, 170 

ISO setting, 82 

Isomap, 265 

Iterated conditional modes (ICM), 216 

Iterative back projection (IBP), 638 

Iterative closest point (ICP), 468, 514, 821 

Iterative feature-based alignment, 507 

Iterative sparse matrix techniques, 934 
conjugate gradient, 934 

Iteratively reweighted least squares, 250 
(ARLS), 510, 570, 696, 947 

Iteratively reweighted least squares (IRLS), 202 


Jacobian, 504, 566, 696, 718, 930, 931 
image, 568 
motion, 571 
sparse, 720, 747, 932 

Joint bilateral filter, 635 


Joint domain (feature space), 489 


K-d trees, 446 
K-means, 259 
K-nearest neighbors (KNN), 241 
Kalman snakes, 471 
Kanade—Lucas—Tomasi (KLT) tracker, 453 
Karhunen—Loéve transform, 147, 262 
Kernel, 125 

bilinear, 126 

Gaussian, 126 

low-pass, 126 

Sobel operator, 126 

unsharp mask, 126 
Kernel basis function, 206 
Kernel density estimation, 198 
Kernel functions, 196 
Kernel methods, 196 
Kernel regression, 196, 198, 252 


Keypoint detection, see Feature detection 
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KinectFusion, 822 Learning rate, see Deep neural networks 
Kinematic model (chain), 845 Least median of squares (LMS), 511 
Kruppa equations, 713 Least squares 
iterative solvers, 695, 934 
L*a*b*, see Color linear, 102, 504, 513, 558, 922, 927, 940, 
L*u*v*, see Color 944 
Lı norm, 210, 559, 581, 831 non-linear, 507, 695, 703, 930, 945 
L norm, 722 


robust, see Robust least squares 


Lambertian reflection, 70 sparse, 719, 933 


Laplacian Eigenmaps, 265 


total, 929 
Laplacian matting, 659 weighted, 212, 505, 634, 637 
Laplacian of Gaussian (LoG) filter, 127 LeNet-5, 291 
Laplacian pyramid, 157 Lens 


blending, 165, 185, 545 


compound, 77 
perfect reconstruction, 157 


nodal point, 77 
thin, 74 

Lens distortions, 63 
calibration, 691 


Lasso (least absolute shrinkage and selection op- 
erator), 197 

Latent Dirichlet process (LDP), 357 

Layered depth image (LDI), 868 


decenteri 4 
Layered depth panorama, 882 ecentering, 6 


radial, 63 


Layered motion estimation, 589 
spline-based, 65 


reflections, 594 
tangential, 64 


Lens law, 74 
Level of detail (LOD), 827 
Level sets, 474, 475 

fast marching method, 474 


transparent, 594 
Layers in image-based rendering, 869 
Layout consistent random field, 387 
Learning, 237 


classification, 239 
geodesic active contour, 474 


Levenberg—Marquardt, 508, 724, 748, 931, 967 
Lidar (Light Detection and Ranging), 816 


nearest neighbors, 241 
contrastive (metric), 315 


deep neural networks, 268 


regression, 239 Lifting, see Wavelets 
self-supervised, 312 Light field 

semi-supervised, 266, 314 higher dimensional, 885 
student-teacher, 316 light slab, 877 
supervised, 237, 239 ray space, 878 

test phase, 239 rendering, 875 

training phase, 239 surface, 880 
unsupervised, 237, 257 Lightness, 91 

weak, 314 Line at infinity, 36 


weakly supervised, 268 Line detection, 477 


Index 


Hough transform, 478 
RANSAC, 480 
simplification, 477, 499 
successive approximation, 477, 499 
Line equation, 36, 38 
Line fitting, 102 
uncertainty, 499 
Line hull, see Visual hull 
Line labeling, 12 
Line process, 231, 771, 949 
Line segment detector (LSD), 481 
Line spread function (LSF), 617 
Line support regions, 480 
Line-based structure from motion, 731 
Linear algebra, 919 
least squares, 927 
matrix decompositions, 920 
references, 920 
Linear blend, 112 
Linear discriminant analysis (LDA), 247 
Linear filtering, 119 
Linear operator, 112 
superposition, 112 
Linear shift invariant (LSI) filter, 122 
Live-wire, 473 
Local distance functions, 265 
Local Laplacian, 159, 186 
Local Linear Embedding (LLE), 265 
Local operator, 119 
Locality sensitive hashing (LSH), 446 
Localization, 698 
Locally adaptive histogram equalization, 118 
Location recognition, 698 
Log likelihood, 244 
Log odds, 245 
Logistic regression, 246, 248 
Logistic sigmoid function (curve), 245 
Logit, 245 
Long short-term memory (LSTM), 321 
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Loopy belief propagation (LBP), 219 
Loss function, 280 
ArcFace, 282 
contrastive, 281 
cross-entropy, 280 
perceptual, 281 
triplet, 282 
Low-pass filter, 126 
sinc, 126 
Lumigraph, 875 
unstructured, 879 
Luminance, 89 


Lumisphere, 880 


M-estimator, 202, 510, 558, 946 
Machine learning, see Learning 
textbooks, 336 
Machine learning models 
discriminative vs. generative, 248 
MAGSAC, see RANSAC 
Mahalanobis distance, 261, 264, 942 
Manhattan world, 732 
Manifold learning, 265 
Manifold mosaic, 541, 909 
Markov chain Monte Carlo (MCMC), 944 
Markov random field, 212, 949 
cliques, 215, 949 
directed edges, 229 
flux, 229 
inference, see MRF inference 
layout consistent, 387 
learning parameters, 213 
line process, 231, 771, 949 
neighborhood, 215, 949 
order, 215, 950 
random walker, 230 
stereo matching, 771 
Marr’s framework, 13 
computational theory, 13 


hardware implementation, 13 
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representations and algorithms, 13 
Masked convolution, 294 
Match move, 723 
Matrix decompositions, 920 

Cholesky, 925 

eigenvalue (ED), 922 

QR, 925 

singular value (SVD), 921 

square root, 925 
Matte reflection, 70 
Matting, 113, 115, 650, 678 

alpha matte, 114 

Bayesian, 654 

blue screen, 115, 180, 651 

difference, 115, 181, 652, 844 

flash, 661 

GrabCut, 657 

Laplacian, 658 

natural, 653 

optimization-based, 656 

Poisson, 657 

shadow, 661 

smoke, 661 

triangulation, 652, 662 

trimap, 653 

two screen, 652 

video, 662 
Max pooling, 295 


Maximum margin classifier, 251 

Mean absolute difference (MAD), 764 
Mean average precision, 443 

Mean shift, 487 

Mean square error (MSE), 100, 764 
Measurement equation (model), 702, 941 
Measurement model, see Bayesian model 
Medial axis transform (MAT), 140 
Median absolute deviation (MAD), 559 


Maximally stable extremal region (MSER), 432 
Maximum a posteriori (MAP) estimate, 213, 949 


Median filter, 132 
Medical image registration, 577 
Medical image segmentation, 390 
Membrane, 205 
Mesh-based warping, 175, 186 
Metamer, 89 
Metric learning, 265 
Metric tree, 447 
Minibatch stochastic gradient descent), 288 
MIP-mapping, 172 

trilinear, 173 


Mixture of Gaussians, see Gaussian mixture 


model 

MLESAC, see RANSAC 
Model selection, 202, 516, 949 
Model zoo, see Deep neural networks 
Model-based reconstruction, 833 

architecture, 833 

heads and faces, 838 

human body, 843 
Model-based stereo, 834, 866 
Models 

Bayesian, 212, 948 

forward, 3 

physically based, 15 

physics-based, 3 

probabilistic, 3 
Modulation transfer function (MTF), 86, 617 
Momentum, see Deep neural networks 
Monocular depth estimation, 796 
Morphable model 

body, 847 

face, 839, 888 

multidimensional, 888 
Morphing, 177, 187, 864, 865 

3D body, 847 

3D face, 840 

automated, 604 

facial feature, 888 
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feature-based, 177, 187 
flow-based, 604 
video textures, 891 
view morphing, 865, 911 
Morphological operator, 138 
Morphology, 138 
Mosaic, see Image stitching 
Mosaics 
motion models, 516 
video compression, 522 
whiteboard and document scanning, 517 
Motion compensated video compression, 562, 600 
Motion compensation, 100 
Motion estimation, 557 
affine, 570 
aperture problem, 568 
compositional, 572 
Fourier-based, 563 
frame interpolation, 593 
hierarchical, 562 
incremental refinement, 566 
layered, 589 
learning, 573, 581 
linear appearance variation, 569 
optical flow, 578 
parametric, 570 
patch-based, 558, 571 
phase correlation, 565 
quadtree spline-based, 577 
reflections, 595 
rolling shutter, 587 
spline-based, 575 
translational, 558 
transparent, 594 
uncertainty modeling, 569 
Motion field, 571 
Motion models 
learned, 573 


Motion segmentation, 605 
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Motion stereo, 784 
Moving least squares (MLS), 830 
MRF inference, 216, 950 

alpha expansion, 219 

belief propagation, 219 

expansion move, 219 

graph cuts, 216 

highest confidence first, 216 

iterated conditional modes, 216 

loopy belief propagation, 219 

simulated annealing, 216 

stochastic gradient descent, 216 

swap move (alpha-beta), 219 
Multi-frame motion estimation, 587 
Multi-layer perceptron (MLP), 272 
Multi-pass transforms, 174 
Multi-perspective panoramas, 523, 882 
Multi-perspective plane sweep (MPPS), 531 
Multi-view stereo, 781 

epipolar plane image, 782 

evaluation, 794 

initialization requirements, 794 

point clouds, 788 

reconstruction algorithm, 792 

scene representation, 788 

shape priors, 792 

silhouettes, 794 

space carving, 793 

spatio-temporally shiftable window, 783 

taxonomy, 787 

visibility, 791 

volumetric, 786, 789 

voxel coloring, 793 
Multidimensional scaling (MDS), 265 
Multigrid, 937 

algebraic (AMG), 486, 938 
Multinomial logistic regression objective, 249 
Multiplane image (MPI), 871 
Multiple hypothesis tracking, 472 


1196 


Multiple object tracking, 600 
Multiple-center-of-projection images, 523, 882, 
909 

Multiresolution representation, 154 

Mutual information, 561, 578 


Naive Bayes, 245 
Natural image matting, 653 
Nearest neighbor, 241 
distance ratio (NNDR), 444 
matching, see Feature matching 
Negative posterior log likelihood, 213, 943, 948 
Neighborhood operator, 119, 131 
Neural architecture search (NAS), 305 
Neural network 
backbone, 304 
backpropagation, 269 
branches, 304 


confidence calibration, 280 


fine tuning, 304 
for face detection, 373 
head(s), 304 
pre-trained, 304 
trunk, 304 
Neural network pooling 
average, 295 
generalized mean (GeM), 295 
max, 295 
unpooling, 295 
Neural rendering, 899 
depth images and layers, 902 
implicit functions using MLPs, 903 
texture mapped meshes and models, 900 
voxel grids, 903 
Neural textures, 902 
Nodal point, 77, 519 
Noise (sensor), 83, 614 
Noise level function (NLF), 83, 104, 614, 675 
Noise removal, see (enoising)148, 188 
Non-linear filter, 132, 179 
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Non-linear least squares 
seeLeast squares, 507 
Non-maximal suppression, see Feature detection 
Non-parametric density modeling, 487 
Non-photorealistic rendering (NPR), 667 
Normal equations, 505, 567, 928, 931 
Normal map (geometry image), 828 
Normal vector, 38 
Normalized cross-correlation (NCC), 561, 602, 
764 
Normalized cuts, 489 
intervening contour, 491 
Normalized device coordinates (NDC), 54, 59 
Normalized exponential, 244 
Normalized sum of squared differences 
(NSSD), 561 
Norms 
Ly, 210, 559, 581, 831 
Lu, 722 
Novel view synthesis (NVS), 864 
Nyquist rate/frequency, 85 


Object detection, 370 

car, 376, 411 

face, 371 

part-based, 377 

pedestrian, 369, 376 
Object tracking, 598 
Object-centered projection, 62 
Obstruction-free photography, 596 
Occluding contours, 760 
Octree reconstruction, 795 
Octree spline, 578 
Omnidirectional vision systeghyms, 896 
One-hot encoding, 249 
Opacity, 114 
OpenGV, 710, 966 
Opening, 138 
Operator linearity, 112 
Optic flow, see Optical flow 
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Optical flow, 578 Pairwise alignment, 503 
anisotropic smoothness, 581 Panography, 506, 549 
benchmarks and datasets, 583 Panorama, see Image stitching 
coarse-to-fine, 583 Panorama with depth, 523, 759, 882 


combinatorial optimization, 583 Para-perspective projection, 53 


constraint equation, 567 
deep learning, 584 
evaluation, 583 

fusion move, 583 

global and local, 580 
Large displacement, 581 
Markov random field, 583 
multi-frame, 587 


neural networks, 584 


Parallel tracking and mapping (PTAM), 739 
Parameter sensitive hashing, 446 
Parametric motion estimation, 570 
Parametric surface, 826 
Parametric transformation, 168, 186 
Part-based recognition, 354 

constellation model, 356 
Partial convolution, 294 
Particle filtering, 472, 847, 944 


normal flow, 568 
patch-based, 579 
region-based, 592 


Partition function, 249 

Parzen window, 198 

Patch-based motion estimation, 558 
PatchMatch, 664 

PatchMatch Stereo, 773 

Peak signal-to-noise Ratio (PSNR), 100, 148 
Pedestrian detection, 376 


regularization, 580 
robust regularization, 581 
smoothness, 580 
task-oriented, 586 


total iati 1 
otal variation, 58 Pabón 


Perceptual loss, 149, 281, 671 
Perceptual similarity metrics, 148, 281 


variational, 581 
Optical illusions, 3 
Optical transfer function (OTF), 86, 616 
Optical triangulation, 816 
Optics, 74 


chromatic aberration, 77 


Performance-driven animation, 454, 842, 888 
Perspective n-point problem (PnP), 694 
Perspective projection, 53 

Perspective transform (2D), 41 

Phase correlation, 565, 602 

Phong shading, 71 

Photo pop-up, 394 

Photo Tourism, 867 

Photo-mosaic, 514 

Photoconsistency, 757, 791 


Seidel aberrations, 76 

vignetting, 78, 676 
Optimal motion estimation, 717 
Oriented particles (points), 829 
Orthogonal Procrustes, 513 
Orthographic projection, 51 
Osculating circle, 761 
Over operator, 114 Photometric image formation, 66 


Overfitting, 199, 242 calibration, 610 
global illumination, 73 


lighting, 66 
Padding, 123, 182 optics, 74 


Overview, 22 
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radiosity, 73 

reflectance, 67 

shading, 71 
Photometric stereo, 811 
Photometry, 66 
Photomontage, 544 
Physically based models, 15 
Physics-based vision, 16 
Pictorial structures, 12, 19, 354 
Pixel transform, 111 
Pliicker coordinates, 39 
Planar pattern tracking, 697 
Plane at infinity, 38 
Plane equation, 38 
Plane plus parallax, 60, 576, 591,757, 870 
Plane sweep, 757, 801 


RGB-D data, 821 
Power spectrum, 146 
Pre-training deep neural networks, 312 
Precision, see Error rates 

mean average, 443 
Preconditioning, 936 
Pretext task, 313 


Principal component analysis (PCA), 262, 373, 


470, 923, 942 
face modeling, 838 
generalized, 924 
missing data, 716, 924 
Principal point, see Image center 
Prior energy (term), 214, 949 
Prior model, see Bayesian model 


Probabilistic generative classification, 243 


Profile curves, 760 
Progressive mesh (PM), 827 


Projections 


Plane-based structure from motion, 733 
Plenoptic function, 876 

Plenoptic modeling, 865 

Plumb-line calibration method, 692, 745 
Point distribution model, 470 


object-centered, 62 
orthographic, 51 
Point operator, 109 para-perspective, 53 
Point process, 109 

Point spread function (PSF), 85 


estimation, 616, 676 


perspective, 53 
Projective (uncalibrated) reconstruction, 710 
Projective depth, 61, 757 
Point-based representations, 829 Projective disparity, 61, 757 
Points at infinity, 36 
Poisson PROSAC (PROgressive SAmple Consensus), 511 
blending, 545 


equations, 831 


Projective space, 36 


PSNR, see Peak signal-to-noise ratio 
Pull-push algorithm, 195 

matting, 657 Pyramid, 149, 184 

noise, 83 blending, 165, 185 

Gaussian, 155 

half-octave, 159 

Laplacian, 157 


surface reconstruction, 831 
Polar coordinates, 37 
Polar projection, 64, 526 
Polyphase filter, 150 
Pop-out effect, 4 


motion estimation, 562 
octave, 155 
Pose estimation, 693 


iterative, 695 


radial frequency implementation, 165 
steerable, 165 
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Pyramid match kernel, 353 


QR factorization, 925 
Quadratic discriminant analysis, 247 
Quadratic form, 208 
Quadrature mirror filter (QMF), 154 
Quadric equation, 37, 39 
Quadtree spline 
motion estimation, 577 
restricted, 577 
Quaternions, 48 
antipodal, 48 
multiplication, 49 
Query by image content (QBIC), 360 
Query expansion, 450 


Quincunx sampling, 159 


Radial basis function, 176, 196, 206, 826 


Radial distortion, 63 
barrel, 63 
calibration, 691 
parameters, 63 
pincushion, 63 
Radiance map, 624 
Radiometric image formation, 66 
Radiometric response function, 611 
Radiometry, 66 
Radiosity, 74 
Random forests, 254 
Random walker, 230 
Range (of a function), 111 
Range data, see Range scan 
Range image, see Range scan 
Range scan 
alignment, 821, 858 
large scenes, 824 
merging, 821 
registration, 821, 858 
segmentation, 821 


volumetric, 824 


Range sensing (rangefinding), 816 


coded pattern, 817 

light stripe, 816 

shadow stripe, 817, 858 

spacetime stereo, 820 

stereo, 819 

texture pattern (checkerboard), 818 
time of flight, 818 


Ranking loss, 281, 315 
RANSAC 


inliers, 511 
preemptive, 511 
progressive (PROSAC), 511 


511, 707, 947 


RAW image format, 83 
Ray space (light field), 878 
Ray tracing, 73 

Rayleigh quotient, 490 
Recall, see Error rates 


Receiver Operating Characteristic 


area under the curve (AUC), 443 
mean average precision, 443 
ROC curve, 443, 495 


Recognition, 343 


category (class), 349 

color similarity, 360 
context, 356 
contour-based, 410 

face, 363 

instance, 346 

part-based, 354 

scene understanding, 356 
semantic segmentation, 387 


shape context, 410 


Rectangle detection, 483 
Rectification, 755, 800 


standard rectified geometry, 756 


Rectified linear unit (ReLU), 272 
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RANSAC (RANdom SAmple Consensus), 480, 
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Recurrent neural networks (RNNs), 321 
Recursive filter, 130 
Reference plane, 61 
Reflectance, 67 
Reflectance map, 809 
Reflectance modeling, 851 
Reflection 
di-chromatic, 73 
diffuse, 70 
specular, 71 
Reflection layers, 594, 872 
Region merging and splitting, 258, 485 
Registration, see Image Alignment 
feature-based, 503 
intensity-based, 558 
medical image, 577 
Regression, 194, 237, 239 
Regularization, 197, 204, 576 
neural network, 274 
robust, 209 
weight decay, 274 
Regularization parameter, 206 
Residual error, 202, 504, 510, 511, 558, 567, 571, 
580, 581, 702, 718, 927, 935 
Residual network (ResNet), 302 
RGB (red green blue), see Color 
Ridge regression, 197 
Rigid body transformation, 40, 44 
Risk minimization, 240 
RMSProp, 290 
Robust data fitting, 202 
Robust error metric, see Robust penalty function 
Robust least squares, 482, 510, 558, 946 
iteratively reweighted, 510, 570, 696, 947 
Robust loss function, 202 
Robust penalty function, 209, 558, 569, 638, 759, 
764, 765, 771, 947 
Robust regularization, 209 
Robust statistics, 559, 945 
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inliers, 511 

loss function, 202 

M-estimator, 202, 510, 558, 946 
Rodrigues” formula, 47 
Rolling shutter wobble removal, 587 
Root mean square error (RMS), 100, 560 
Rotations, 45 

Euler angles, 45 

axis/angle, 46 

exponential twist, 47 

incremental, 50 

interpolation, 50 

quaternions, 48 


Rodrigues” formula, 47 


Sampling, 84 
Scale invariant feature transform (SIFT), 435 
Scale-space, 14, 127, 158, 475 
Scatter matrix, 262 
Scattered data approximation, 194 
overfitting, 199 
underfitting, 199 
Scattered data interpolation, 176, 194 
Scene completion, 394 
Scene flow, 785, 893 
Scene understanding, 356 
gist, 358, 394 
Schur complement, 720, 932 
Scratch removal, 665 
Seam selection in image stitching, 541 
Second-order cone programming (SOCP), 722 
Seed and grow 
stereo, 760 
structure from motion, 726 
Segmentation 
active contours, 467 
affinities, 489 
binary MRF, 216, 227 
CONDENSATION, 472 


connected components, 141 
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energy-based, 227 

Gaussian mixture model, 259 

geodesic active contour, 474 

geodesic distance, 230 

GrabCut, 228, 657 

graph cuts, 227 

graph-based, 486 

hierarchical, 485, 486 

intelligent scissors, 473 

joint feature space, 489 

k-means, 259 

level sets, 474 

mean shift, 487 

medical image, 390 

merging, 258, 485 

minimum description length (MDL), 227 

Mumford—Shah, 227 

non-parametric, 487 

normalized cuts, 489 

probabilistic aggregation, 486 

random walker, 230 

snakes, 467 

splitting, 258, 484 

stereo matching, 775 

thresholding, 138 

tobogganing, 474, 485 

watershed, 485 

weighted aggregation (SWA), 491 
Seidel aberrations, 76 
Self-attention, 324 
Self-calibration, 712 

bundle adjustment, 714 

Kruppa equations, 713 
Self-supervised learning, 312 
Semantic image synthesis, 333 
Semantic segmentation, 387 
Semi-global matching (SGM), 775, 779 
Semi-supervised learning, 266, 314 


transductive vs. inductive, 268 
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Sensing, 79 
aliasing, 84, 616 
color, 87 
color balance, 94 
gamma, 94 
pipeline, 80, 612 
sampling, 84 
sampling pitch, 81 
Sensor noise, 83, 614 
amplifier, 83 
dark current, 83 
fixed pattern, 83 
shot noise, 83 
Separable filtering, 124, 182 
Sequential minimal optimization (SMO), 253 
Shading, 71 
equation, 70 
shape-from, 809 
Shadow matting, 661 
Shape context, 410, 465 
Shape from 
focus, 814, 857 
photometric stereo, 811 
profiles, 760 
shading, 809 
silhouettes, 794 
specularities, 814 
stereo, 749 
texture, 814 
Shape parameters, 366, 470 
Shape-from-X, 14 
focus, 14 
photometric stereo, 14 
shading, 14 
texture, 14 
Shift invariance, 122 
Shiftable multi-scale transform, 165 
Shutter speed, 81 


Siamese network, 282 
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Sigmoid activation function, 272 
Sigmoid function, 245 
Signed distance function, 474, 821, 822, 829, 831 
Silhouette-based reconstruction, 794 
octree, 795 
visual hull, 794 
Similarity metrics 
perceptual, 148 
Similarity metrics (perceptual), 148 
Similarity transform, 41, 45 
Simulated annealing, 216 
Simultaneous localization and mapping (SLAM), 
734 
Sinc filter 
interpolation, 152 
low-pass, 126 
windowed, 152 
Singe image depth estimation, 796 
Single view metrology, 688, 744 
Singular value decomposition (SVD), 921 
Skeletal set, 722, 727 
Skeleton, 140, 465 
Skew, 56, 57 
Slant edge calibration, 616 
Slippery spring, 469 
SlowFast neural network architecture, 319 
Smoke matting, 661 
Smoothness constraint, 206 
Smoothness penalty, 206 
Snakes, 467 
ballooning, 467 
dynamic, 471 
internal energy, 467 
Kalman, 471 
shape priors, 470 
slippery spring, 469 
Soft assignment, 261 
Softmax function, 244, 249 
Software, 961 


Computer Vision: Algorithms and Applications, 2nd ed. (final draft, Sept. 2021) 


Space carving 
multi-view stereo, 793 
Spacetime stereo, 820 
Sparse flexible model, 355 
Sparse matrices, 932 
compressed sparse row (CSR), 932 
skyline storage, 932 
Sparse methods 
direct, 932 
iterative, 934 
Spatial pyramid matching, 353 
Spatially varying bidirectional reflectance distri- 
bution function (SVBRDF), 853 
Spectral (weight) normalization, 279 
Spectral response function, 92 
Spectral sensitivity, 92 
Specular flow, 814 
Specular reflection, 71 
Spherical coordinates, 38, 479, 524 
Spherical linear interpolation, 50 
Spin image, 821 
Splatting, see Forward warping, 195 
volumetric, 829 
Spline, 195 
controlled continuity, 205 
octree, 578 
quadtree, 577 
tensor product, 195 
thin plate, 205 
Spline-based motion estimation, 575 
Splining images, see Laplacian pyramid blending 
Sprites 
image-based rendering, 869 
motion estimation, 589 
video, 891 
video compression, 522 
with depth, 870 
Statistical decision theory, 941, 944 


Statistical models: discriminative vs. generative, 
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248 uncertainty, 769 
Steerable filter, 128, 183, 184 window-based, 766, 802 
Steerable pyramid, 165 winner-take-all (WTA), 768 
Steerable random field, 218 Stereo-based head tracking, 769 
Stereo, 749 Stiffness matrix, 208 
aggregation methods, 767, 802 Stitching, see Image stitching 
coarse-to-fine, 773 Stochastic gradient descent (SGD), 216, 287 
cooperative algorithms, 772 Strided convolution, 294 
correspondence, 751 Structural Similarity (SSIM) index, 148 
curve-based, 760 Structure from motion, 684 
deep networks, 778 bas-relief ambiguity, 723, 724 
dense correspondence, 762 bundle adjustment, 717 
depth map, 751 constrained, 731 
dynamic programming, 774 factorization, 715 
edge-based, 760 feature tracks, 725 
epipolar geometry, 753 iterative factorization, 716 
feature-based, 760 line-based, 731 
global optimization, 771, 802 multi-frame, 715 
graph cut, 772 orthographic, 715 
layers, 777 plane-based, 717, 733 
local methods, 766 projective factorization, 716 
model-based, 834, 866 seed and grow, 726 
multi-view, 781 self-calibration, 712 
non-parametric similarity measures, 764 skeletal set, 722, 727 
photoconsistency, 757 two-frame, 703 
plane sweep, 757, 801 uncertainty, 723 
rectification, 755, 800 Student-teacher learning, 316 
region-based, 766 Style transfer, 669 
scanline optimization, 775 Sub-modular energy functions, 217 
seed and grow, 760 Subdivision surface, 827 
segmentation-based, 766, 775 subdivision connectivity, 827 
semi-global matching (SGM), 775, 779 Subspace learning, 265 
shiftable window, 783 Sum of absolute differences (SAD), 558, 602, 764 
similarity measure, 764 Sum of squared differences (SSD), 558, 602, 764 
spacetime, 820 bias and gain, 560 
sparse correspondence, 760 Fourier-based computation, 564 
sub-pixel refinement, 768 normalized, 561 
support region, 766 surface, 424 


taxonomy, 753, 762 weighted, 559 
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windowed, 559 
Sum of sum of squared differences (SSSD), 781 
Summed area table, 129 
Super-resolution, 637, 677 
example-based, 639 
faces, 640 
hallucination, 639 
prior, 639 
video, 643 
Superposition principle, 112 
Superquadric, 831 
Supervised learning, 237, 239 
SuperVision neural network, 299 
Support vector machine (SVM), 250, 374, 377 
Support vectors, 252 
Surface element (surfel), 829 
Surface interpolation, 826 
Surface light field, 880 
Surface representations, 825 
non-parametric, 827 
parametric, 826 
point-based, 829 
simplification, 827 
splines, 827 
subdivision surface, 827 
symmetry-seeking, 826 
triangle mesh, 827 


Surface simplification, 827 


t-distributed Stochastic Neighbor Embedding (t- 
SNE), 265 
Telecentric lens, 51, 815 
Temporal derivative, 567, 580 
Temporal texture, 891 
Testing algorithms, viii 
TextonBoost, 387 
Texture addressing mode, 124 
Texture map 
recovery, 850 
view-dependent, 851, 865 
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Texture mapping 
anisotropic filtering, 174 
MIP-mapping, 172 
multi-pass, 174 
trilinear interpolation, 173 
Texture synthesis, 663, 679 
by numbers, 668 
hole filling, 665 
image quilting, 664 
non-parametric, 664 
transfer, 667 
Texture, shape-from, 814 
Thin lens, 74 
Thin-plate spline, 205 
Thresholding, 138 
Through-the-lens camera control, 697, 723 
Tobogganing, 474, 485 
Tonal adjustment, 119, 181, 182 
Tone mapping, 627 
adaptive, 628 
bilateral filter, 630 
global, 627 
gradient domain, 630 
halos, 628 
interactive, 632 
local, 628 
scale selection, 632 
Total least squares (TLS), 570, 929 
Total variation, 210, 581, 831 
Tracking 
feature, 452 
head, 769 
human motion, 843 
multiple hypothesis, 472 
multiple object, 600 
object, 598 
planar pattern, 697 
PTAM, 739 


Training error, 201 
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Transfer learning, 305, 313 
Transformers, 322 
Translational motion estimation, 558 
bias and gain, 560 
Transparency, 114 
Transposed convolution, 295 
Travelling salesman problem (TSP), 468 
Tri-chromatic sensing, 88 
Tri-stimulus values, 88, 93 
Triangular irregular network (TIN), 194 
Triangulation, 701 
planar, 194 
Trilinear interpolation, see MIP-mapping 
Trimap (matting), 653 
Triplet loss, 282 
Truncated signed distance function (TSDF), 822, 
829 
Trust region method, 931 


Two-dimensional Fourier transform, 146 


U-Net, 298 
Uncertainty 
correspondence, 505 
modeling, 512, 952 
weighting, 505 
Underfitting, 199, 242 
Unpooling, 295 
Unsharp mask, 126 
Unsupervised learning, 237, 257 
clustering, 257 
principal component analysis, 262 


Upsampling, see Interpolation 


Validation error, 201 
Validation set, 201 
Vanishing point 
detection, 481, 500 
Hough, 482 
modeling, 834 
uncertainty, 500 
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Variable reordering, 932 
minimum degree, 932 
multi-frontal, 932 
nested dissection, 932 

Variable state dimension filter (VSDF), 721 

Variational autoencoder (VAE), 329 

Variational method, 205 

Variational methods, 204 

VGG neural network, 300 

Video compression 
motion compensated, 562 

Video compression (coding), 600 

Video denoising, 589 

Video matting, 662 

Video object segmentation, 597 

Video objects (coding), 589 

Video segmentation, 597 

Video sprites, 891 

Video stabilization, 573, 603 

Video super-resolution, 643 

Video texture, 889 

Video understanding, 396 

Video-based animation, 888 

Video-based rendering, 887 
3D video, 893 
animating pictures, 892 
sprites, 891 
video texture, 889 
virtual viewpoint video, 893 
walkthroughs, 896 

View correlation, 723 

View interpolation, 714, 863, 910 

View morphing, 714, 865, 891 

View-dependent texture maps, 865 

Vignetting, 78, 560, 615, 676 
mechanical, 79 
natural, 78 

Virtual viewpoint video, 893 


Vision Transformer, 326 
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Visual hull, 794 White balance, 94, 104 
image-based, 795 Wiener filter, 146 

Visual illusions, 3 Wire removal, 665 

Visual localization, 698 Wrapping mode, 124 


Visual object tracking, 598 
Visual odometry, 734 XYZ, see Color 
Visual place recognition, 698 

Visual search, 360 

Visual similarity (search), 360 

Visual words, 352, 447, 449 

Visual-inertial odometry, 736 


Vocabulary tree, 447 


Zippering, 821 


Volumetric 3D reconstruction, 786 

Volumetric performance capture, 895 
Volumetric range image processing (VRIP), 822 
Volumetric representations, 830 

Voronoi diagram, 541 

Voxel coloring multi-view stereo, 793 

VQ-VAE, 330 


Watershed, 485, 487 
basins, 485, 487 
oriented, 485 
Wavelets, 159, 186 
compression, 186 
lifting, 162 
overcomplete, 160, 165 
second generation, 164 
self-inverting, 165 
tight frame, 160 
weighted, 164 
Weak learning, 314 
Weakly supervised learning, 268 
Weaving wall, 761 
Weight decay, 197, 274 
Weight initialization, see Deep neural networks 
Weight sharing, 293 
Weight standardization, 279 
Weighted least squares (WLS), 212, 632 
Weighted prediction (bias and gain), 560 


